Nonlinear Dynamical Analysis and 
Predictive Coding of Speech 


A Thesis Submitted 

in Partial Fulfilment of the Requirements 
for the Degree of 

DOCTOR OF PHILOSOPHY 


ARUN KUMAR 


to the 

DEPARTMENT OF ELECTRICAL ENGINEERING 

INDIAN INSTITUTE OF TECHNOLOGY KANPUR 

December, 1994. 



Certificate 



It IS certified that the work contained in the thesis enhtled "Nonlinear Dynamical 
\nalysis and Predichve Coding of Speech", by Arun Kumar, has been carried out 
under my supervision and that this work has not been submitted elsewhere for a 
degree. 



S K Mulhck 
Professor 
E E Department 
I I T, Kanpur 

December, 1994 


111 



c: 3 jui ic: 




Bh- I9Q4- D- K/OK/ 




7 » 





Acknowledgements 


I wish to express my deep sense of gratitude to my supervisor Professor S K. Mullick 
for his guidance, advice and encouragement The various values that I tried to learn 
from him shall remain a source of inspiration for me forever I owe it to my teachers, 
particularly Professors R Sharan, PR K Rao, VP Smha, M U Siddiqui, G C Ray, 
R K Bansal, Mrs S Gupta, G Sharma and A Mahanta for the deep insights given 
through the various courses they taught, their advice, encouragement and interest 
in our work through the years 

Professors J K Bhattacharjee and K Banerjee of the Physics department have 
regularly offered a course on Chaos Theory since the early stages of development of 
the subject itself I had credited the course during my Undergraduate study Their 
foresight and lucid exposihon of the subject helped me to embark on this topic of 
research 

I would like to acknowledge the help of many persons during the progress of 
the research work First and foremost, I would hke to thank Dr Rakesh Mullick who 
saw to it that our research work was never delayed by promptly mailing numerous 
reference papers and other study material whenever 1 needed them 1 would also 
like to thank Dr P V Dhamija of the Central Institute of English and Foreign 
Languages, Hyderabad and Dr. K. Sruam and his colleagues at Bell Laboratories, 
USA, for providing the two speech databases used throughout this work. Dr S 
Prasad of 1 1 T. Delhi for providing CELP decoded speech data and Dr Preeti Rao 
for helping in evaluating the quality of decoded speech and makmg suggestions for 
the codec studied by us 

I have benefitted greatly from the interactions with many research scholars and 
the cooperation extended by them whenever required For this, I would like to thank 
Mr Ajit Chaturvedi, Abhay Karandikar, Venkatesh, Deepak Murthy, Chaitanya Babu, 
Jitendra Das, Sudipto Mukhopadhyaya, Puranjoy Bhattacharya, Balvinder Singh and 



vili 


Acknowledgements 


Apu Sivadas I will specially cherish my association with fellow research scholar Mr 
Samarendra Dandapat who has been a constant companion in this endeavour 

Finally, I thank everyone who made this work possible and the experience 
enjoyable 



Synopsis 


Speech signal coding has been has been an achve field of research tor over a 
couple of decades and continues to be so in spite of the increasing proliferation 
of optical transmission media of relatively unlimited bandwidth This is because 
of the continued and in fact mcreasing use of bandlmuted media such as satellite 
lines and radio channels and bit limited storage media such as CD-ROMs Also, 
the applications of speech codmg have become numerous in recent times A major 
effort has been given in the last ten years to the development of a class of analysis 
-by - synthesis coding schemes for low and medium bit rate speech coding Most 
medium and low bit rate speech coders are based on the speech production model 
of a time varying hnear filter excited by a source Such coders are usually designed 
to estimate the linear filter coefficients and the excitation sequence in a frame - by 
- frame manner such that the output of the filter approximates the speech signal 
in some sense. A large portion of the recent research effort has been directed to 
the design of appropriate excitation functions rather than to investigate alternatives 
to the linear filter form However, the above paradigm for speech coding may 
now be approaching a stage of saturation as far as improvements in terms of 
performance parameters are concerned Further gains m speech coding are likely to 
accrue by incorporating deeper physiological aspects of the human speech production 
mechanism and characteristics of the speech signal in the coder structures Towards 
this end, we have done a dynamical analysis study of speech signals and explored 
some nonlinear representational forms for predictive coding of speech 

The thesis documents our investigation of a nonlinear framework for speech 
signal coding The complete study can be classified as an investigation of three 
related problems The first problem is to choose a sufficiently general framework 
for nonlinear speech processing Specifically, we use a deterministic state space 
framework The choice of a deterministic framework rather than a stochastic one is 
because of our interest in modelling the time waveform itself instead of its statistical 
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moments The motivation for using a deterministic state space framework is due 
to the recent advances m the understanding and characterization of the complex 
behaviour of deterministic chaos in dynamical systems Viewing complex time series 
behaviour as ansing out of low dimensional chaos gives a new tool tor analysing 
and modellmg it deterministically In this framework, the speech time series is 
embedded m a reconstructed state space as a reconstructed trajectory We have done 
a detailed analysis of the reconstructed trajectories of unit articulations ot speech, 
namely phonemes, m terms of dynamical attributes such as dimension, metric entropy 
and Lyapunov exponents Just as a correlation analysis helps in a linear modelling 
exercise, these dynamical attributes help in building nonlinear, determinishc state 
space models As the second problem, we study and compare with linear prediction, 
the performance of some adhoc nonlinear state space based predictive models tor 
speech We have also implemented and carried out preliminary performance tests 
of a local state prediction based low to medium delay speech coding scheme in the 
medium bit rate range The third problem addresses a related question ot estimating 
the minimum rate at which information about a source can be transmitted to the 
user subject to the condition that it can be reproduced with a specified distortion 
We give an algorithm to compute a lower bound of the rate distortion function 
for stationary ergodic sources with memory Both discrete and continuous alphabet 
sources are considered. Finally, we use this algorithm to compute the lower bound 
for quantized speech sources. 

In the following, we give a chapterwise summary of the thesis Chapter 1 begins 
with a contextual review of speech coding Thereafter, we build a case tor the study 
of nonlinear analysis and modelling of speech in terms of (a) observahons from the 
speech production mechanism, (b) observations from the speech signal, (c) limitations 
of a linear model, and (d) advances in nonlinear analysis and modelling techniques 
We give a qualitative discussion of the notions of randomness, determinateness and 
predictability in determmistic dynamics, particularly in the light ot chaos theory 
A brief discussion of the three problems in the thesis is given next These are 
(i) nonlinear dynamical analysis of speech signals, (ii) state space predictive modelling 
of speech, and (iii) computation of a lower bound of the rate distortion function tor 
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stationary ergodic sources with memory We also record m a historical note, the 
recent investigations using tools from nonlinear dynamics for speech signal analysis 
and studies m nonhnear predichve modelling and coding of speech. 

In chapters 2 and 3, we are concerned with nonlinear dynamical analysis study of 
speech signals Chapter 2 begins with a discussion of the theorems that form the basis 
for dynamical analysis of time series data These theorems give generic conditions 
for reconstructing a state space trajectory from a scalar observable of a dynamical 
system evolution such that the dynamical invariants obtained from the reconstructed 
trajectory will be the same as those of the original dynamical system We discuss two 
methods for optimal state space reconstruction based on singular value decomposition 
and redundancy criteria and use them to reconstruct speech trajectories and make 
observations We also give results of the computation of the largest Lyapunov 
exponent of reconstructed trajectories of phoneme articulations Lyapunov exponents 
give a coordinate independent measure of the local stability properties of a state 
space trajectory They asymptotically categorize bounded trajectories into equilibrium 
points, periodic solutions, quasipenodic solutions and chaos We compare the 
largest Lyapunov exponent results for speech with those obtained from synthetically 
generated periodic and quasipenodic data From the results and comparison tests, 
we conclude that reconstructed speech trajectories exhibit exponential divergence on 
the average 

In chapter 3, we give results of the computation of two dynamical invariants, 
namely the correlation dimension and second order dynamical entropy of speech 
The notion of dimension in dynamical systems is associated with the number ot 
degrees of freedom that a system possesses The "dimension" attribute of a time 
senes is helpful in a deterministic state space modelling exercise because it gives 
the necessary and sufficient number of independent state space variables needed to 
model the data A large dimensionality means that the trajectory is "complex" and 
has numerous degrees of freedom m which case a random process model may be a 
better choice A study of various phoneme categories shows that speech is largely 
a low dimensional signal We have also computed the correlation dimension from 
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a simplified statistical model of a particular vowel utterance The dimension results 
are qualified with a study of the various sources of error affecting the estimates 

The second dynamical invariant in which we are interested in chapter 3 is the 
metric entropy which quantifies the rate of loss of information about the initial state 
of a dynamical system as it evolves in time Its relevance in the context ot state 
space modelling is because it is inversely proportional to the average time duration 
for which a dynamical system (or a time series model) can be predicted from a gnen 
mitial condition. We have computed the second order dynamical entropy ot speech 
which is a lower bound of the metric entropy The positive values of the second order 
entropy and the largest Lyapunov exponent (chapter 2) for phoneme articulations 
both give evidence of the average divergence of nearby speech tra)ectones ot speech 
in the reconstructed state space 

Based on the dynamical analysis results, we have investigated some nonlinear 
prediction schemes for speech signal modelling and coding in a state space tramewurk 
m chapters 4 and 5 Chapter 4 begins with a review of the salient features ot the 
analysis - by - synthesis class of hnear prediction coders and in particular the CELP 
coding scheme We give some model based analysis results which make a case 
for nonlinear modelling of speech Thereafter, we study the performance of some 
nonhnear representation forms for predictive modelling of speech in a state space 
framework There are two basic schemes in this framework These are the global 
and local prediction schemes. In a global prediction scheme which we study in 
this chapter, the function parameters are optimized over the entire state space As 
a first choice, we investigate the (quadrahc) polynomial representation form The 
basis of comparison with short term Linear Prediction (LP) in terms of segmental 
predichon gain, is the number of coefficients in the two predictor models We have 
principally considered two ordering schemes for selection of model terms of the 
quadratic predictor In the first method, we exhaust all possible terms upto a certain 
time lag before considering terms which include signal dependence for greater lags 
The second method is based on orthogonal term selection from a set of candidate 
terms While the first method does not perform as well as a short term LP in terms of 
segmental prediction gain, the second method gives a modest improvement over LP 
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for the same number of model coefficients. In another study on quadratic predictors 
leported recently, the basis of comparison with LP is the time delay upto which 
signal correlations are considered rather than the number of model coefficients In 
this case, the performance of the former is significantly better in terms of segmental 
prediction gam 

In chapter 5, we mvestigate a Local State Prediction (LSP) scheme for speech 
In this scheme, the representation form is optimized over a local volume in state 
space where the predichon is to be done The LSP scheme is studied in terms of 
segmental prediction gam, plots of the prediction error sequence, their spectrum 
and autocorrelation function and is compared with the error sequence resulhng from 
(i) short term LR and (ii) short term plus long term LP In LSP an appropriately chosen 
neighbourhood of a "target" point m the reconstructed space will contain trajectory 
points that are close to it m time as well as those which are approximately an 
integral number of pitch or formant periods away Thus, a LSP attempts to simulate 
the funchons of both short term and long term linear prediction simultaneously 
Note that m the LSP scheme studied by us, the local neighbourhood is chosen from 
an analysis frame length of previous data values. The performance of a local linear 
predichon scheme in the above terms can be broadly categorized as lying between 
short term and short term plus long term LP (where both the predictions are done 
m forward adaphve mode) 

We have done a preliminary study of a framework for low to medium delay 
speech coding m the medium bit rate range based on LSP It is an analysis - by 
- synthesis coder operationally similar to CELP and tentahvely named as a Vector 
Excited Local State Prediction (VELSP) coder. The followmg points highlight the 
coding scheme and bring out the differences with CELP 

(i) LSP IS performed mstead of LP The LSP is performed using previous reproduced 
speech which is available to the decoder as well 

(u) A single excitation codebook designed from empirical data is used instead of 
two separate codebooks as in CELP taking advantage of the prediction property 
of LSP 
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(m) A LSP based coder is naturally suited to low delay coding 

(iv) Since a LSP based coder is nonlinear, we give a method to incorporate the gam 

factor 

We have implemented and studied a basic form of the coder structure at rates 
of 5 2, 6 5 and 80 Kbits/s having theoretical coding delay ot 7 5ms, hOms and 
4 875ms respectively While the segmental SNR performance is similar to C LI P, the 
reproduced speech has perceptible noise and becomes poor at the 5 2 Kbitws rate The 
preliminary investigation suggests that the VELSP coding scheme ean be a candidate 
for a more detailed study in terms of (a) improving the quality ot reproduced speech 
by incorporating perceptual distortion measure, adaptive posttiltermg depending 
on a detailed study of the nature of reproduction error etc, and (li) reducing the 
computational complexity of the coder and decoder 

In chapter 6, we give an application of the computation ot a lower bound, 
R’^{D), of the rate distortion function, R{D), for stationary ergodic sources with 
memory The relevance of the rate distortion function R{D) in the context ot signal 
compression is that it gives the minimum rate R at which information about a 
source can be transmitted subject to the constraint that it can be reproduced with an 
average distortion D Specifically, R^{D) — Ri{D) H - //(A) tor discrete alphabet 
sources and R^{D) — Ri{D)+h — h{X) for continuous alphabet sources, where RiiD) 
IS the first order rate distortion function, H{X){h{X)} is the entropy {ditterential 
entropy} of the source X based on the marginal probability density ot the source 
X and H{h} is the entropy rate {differential entropy rate} of the source X The 
estimation of the rates H{h} becomes difficult as statistical dependencies tor larger 
time frames are successively considered We give procedures to estimate the seconc 
order entropy rate H 2 (i ?2 < H) and the second order differential entropy rat< 
^2 (h .2 < h) using a method of generalized correlation sum which is conjectured t( 
give better estimates than the histogram technique The procedures are based 01 
extensions of a method to estimate the metric entropy that has become standard 1 
dynamical systems literature in the last ten years We give examples to show tb 
efficacy of this estimation scheme We also compute the lower bound ot R{D) wil 
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respect to the mean square error distortion criterion for quantized speech sources of 
resolution 6, 8 and 10 bits/sample 

Finally, we conclude in chapter 7 by summarizing the contributions of the thesis 
and nohng some directions for further investigations in these areas 
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Chapter 1 

Introduction 


This thesis documents our investigation of a nonlinear framework for speech signal 
compression The complete study can be classified as an mvestigation of three related 
problems The first problem is to choose a sufficiently general framework for nonlinear 
speech processing Specifically, we use a deterministic state space framework The 
speech time series is embedded m a reconstructed state space as a reconstructed trajectory 
We have done a detailed analysis of the reconstructed trajectories of unit speech 
utterances, namely phonemes, m terms of dynamical attributes such as dimension, 
metric entropy and Lyapunov exponents Just as a correlation analysis helps in 
a linear modelling exercise, these dynamical attributes help in building nonlinear, 
deterministic state space models As the second problem, we study and compare 
with linear prediction, the performance of some adhoc nonlmear state space models 
for speech We have also proposed and carried out preliminary performance tests 
of a local state prediction based low to medium delay speech coding scheme in the 
4 8-8 kb/s range The third problem addresses a related question of estimahng the 
minimum rate at which information about a source can be transmitted to the user 
subject to the condition that it can be reproduced with a specified average distortion 
We give an algorithm for the computation of a lower bound of the rate distortion 
function for stationary ergodic sources with memory Both discrete and continuous 
alphabet sources are considered Finally, we use this algorithm to compute the lower 
bound for quantized speech sources 
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1 Introduction 


1.1 A Brief Contextual Review of Speech Coding 

Speech signal compression has been an achve field of research tor over a couple ot 
decades and conhnues to be so A glance at the recent proceedings of the International 
Conference on Acoushcs, Speech and Signal Processing (ICASSP) and other relevant 
journals would immediately testify to this fact The primary utility ot speech 
compression algorithms lies in speech source coding for distance communication 
and signal storage Figure 1 1 shows a general block diagram representation of a 
communication system that subsumes these two functions ot a speech compression 
algorithm The function of the source encoder is to minimize the necessary bit rate 
for faithfully reproducing the source signal This is done by removing redundancy 
in the signal and thereby compressing it The channel encoder seeks to introduce 
redundancy to the source encoder output for the purpose ot error protection The 



Fig. 1.1: Block diagram of a communication system 
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channel and source decoders perform the inverse operations of the respective en- 
coders It IS often difficult to establish a firm line of demarcation between the 
successive blocks m the diagram In a speech communication system, the barrier 
between the source and the source encoder can be at the terminal of a microphone 
into which a person is speaking in which case it is an analog electrical signal, but 
more often in a digital environment, the input to the source encoder is considered 
to be a prefiltered and finely quantized version of the analog signal. Furthermore, 
to increase the overall efficiency of a commumcahon system, the functions of the 
source and channel encoder are somehmes integrated together. 

Signal compression has remained a frontier research area in spite of the increas- 
ing proliferahon of optical transmission media of relahvely unlimited bandwidth 
This is because of the contmued and, in fact, increasing use of bandlimited me- 
dia such as satellite Imes and radio channels and bit limited storage media such 
as CD-ROMs Also, the applications of speech coding have become numerous in 
recent times To quote a spoof ([52], pp 11), "Resources such as bandwidth obey 
a corollary to Parkinson's Law Resource use will expand to meet the resources 
available " Some applications to benefit from efficient speech coding algorithms are 
mobile satellite communicahons, cellular radio, audio for videophones and video- 
conferencing, universal cordless telephones, interachve PC software, voice message 
broadcashng etc 

Given the prohferation of speech compression algorithms, and even otherwise, it 
IS necessary to judge their performance and compare them with exishng algorithms 
This is broadly done m terms of four parameters bit rate, decoded signal quality, 
complexity of implementahon and commumcahon delay [74] 

(1) Bit Rate: This is usually measured in bita'sample or bits/second which is the 
product of the sampling rate and the number of bits/sample The sampling rate is 
usually slightly higher than twice the signal bandwidth We are interested in the 
telephony grade of audio bandwidth where the frequency band is restricted from 
200 Hz to 3400 Hz thus requiring a bandwidth of 3 2 kHz, and the sampling rate is 
8 kHz 
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(2) Decoded Signal Quality: A lossy compression scheme is most likely to cause a 
degradation in the reconstructed or decoded signal in terms of perceived quality 
In the case of speech, the perceived quality is usually measured on a five point 
subjective scale known as the mean opinion score or mos scale [75], [79], [80] The five 
points of quality are associated with a set of standardized descriptions bad, poor, 
fair, good and excellent An average over many speakers, listeners and speech signal 
segments evaluates the quality. 

(3) Complexity: It is mainly the computational effort required to implement the 
encoding and decoding processes and is usually measured in terms of arithmetic 
operations and memory requirement Other aspects of coding complexity are the 
physical size of the encoder and decoder, their cost and power consumption 

(4) Communication Delay: This refers to the total delay for one - way communication 
1 e coding plus decoding delay Low bit rate coding is typically associated with 
mcreased complexity and processing delays in the encoder and decoder While 
communication delay is largely irrelevant for applications such as broadcasting and 
storage and message forwarding, it is an important constraint in other applications 
like network telephony where the delay requirement can be as low as an order of 
few milhseconds 

Apart from these primary parameters of performance, there exist other practical 
considerations before a signal compression algorithm can be considered for imple- 
mentation in a source coding scheme One such consideration is the degradation 
in performance of an encoder decoder pair in the presence of typical channel noise 
conditions under which it is expected to perform 

Overall speech coding systems are often referred to as toll quality, network 
quahty, vocoder quality etc Toll cjuality refers to high quality reproduction, low delay 
coders and is usually associated with high rate coders Network quality is used to 
refer to those coders which apart from being toll quality, are capable of performing 
addihonal functions such as multiple stages of encoding and decoding of speech 
and high accuracy transmission of non - speech voice band signals such as modem 
aveforms and network signalling tones Vocoder quality coders are those which 
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maintain high level of intelligibility in the reproduced speech, but speech quality, 
naturalness and speaker recognizability are all severely compromised 

The development of speech compression technology has been supported with 
the promulgahon of coding standards The premier regulatory body in charge of 
developing speech coding standards is the Telecommunication Standardization Sector 
of the International Telecommunications Union, referred to as the (ITU-T) which is 
the successor of the International Telegraph and Telephone Consultative Committee 
(CCITT) For a discussion on the methodology of standards development followed 
by the CCITT, see for example [9], [71] The tradihonal bit rate for network quality 
telephony was 64 kb/s This was standardized as the /x-law and A-law Pulse Code 
Modulation (PCM) [75], [114], in the late 1960's and amended in 1972 Subsequently, 
CCITT standardized a 32 kb/s Adaptive Differential Pulse Code Modulation (ADPCM) 
[75], [114], scheme in 1984 and revised it in 1986 [11] An 'almost' network quality 
telephony grade encoding scheme at 16 kb/s was standardized by the CCITT in 
1992 This speech coder is based on a Low Delay - Code Excited Linear Prediction 
(LD-CELP) scheme [29] At the other end of the coding standards' spectrum is the 
U S Government standard 2 4 kb/s LPC-10 vocoder [145]. This is mainly used in 
government and defence communications which require digital encryption over a 
wide range of transmission media Between the two limits of 2 4 kb/s and 16 kb/s 
are a wide variety of local standards due to various organizations Some examples 
are the U S Government Federal standard (FS1016) of 4 8 kb/s CELP based coder 
[23], the Japanese digital cellular radio standard of 6 7 kb/s based on Vector Sum 
Excited LP (VSELP) [53], the North American digital cellular radio standard (1S54) 
of 8 kb/s VSELP and the Pan - European digital cellular radio standard (GSM) 
of 13 kb/s based on Regular Pulse Excitation with Long Term Predictor (RPE-LTP) 
coder [146] All these coding schemes are based on analysis - by - synthesis linear 
prediction coding technique [86], [87] Their performance also lies between vocoder 
and network quality in that they reproduce high quality speech but involve a fairly 
large coding delay of 50 to 60 ms 

A major effort in the last 10 years has been given to the development of analysis 
-by - synthesis speech coding for bit rates between 4 8 kb/s and 16 kb/s Coders 
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designed for this range are usually referred to as "medium bit rate" speech coders 
[68] Present speech coding research activity is concentrated on the design of speech 
coders in the 2 4 - 4 8 kb/s range which reproduce natural sounding speech Coders 
in this range are usually referred to as "low bit rate" coders Design of medium 
bit rate speech coders with specific properties such as low coding delay is also an 
active field of research For recent reviews on advances and directions in speech 
compression research, see [74], [51] 

Most medium and low bit rate speech coders are based on the speech production 
model consisting of a time varying linear filter excited by a source The coders are 
designed to estimate the linear filter coefficients and the excitation sequence in a 
frame - by - frame manner such that the output of the filter approximates the speech 
signal in some sense This model is a gross simplification of the speech produchon 
mechanism of the human vocal apparatus (see, for example [38], [42], [114] for a 
detailed discussion) Let us consider this briefly The human vocal system consists 
of the vocal tract which begins at the openmg between the vocal chords or glottis and 
ends at the lips The total length of the vocal tract in an average male is about 17 
cm The nasal tract is connected to the vocal tract by the velum and ends at the 
nostrils When the velum is lowered, the nasal tract is acoustically coupled to the 
vocal tract to produce the nasal sounds of speech The sub-glottal system consists ot 
the lungs, bronchi and trachea, which serve as a source of energy for the production 
of speech 

Speech sounds are classified into three distinct classes according to the mode of 
excitation. Voiced sounds are produced by forcing air through the glottis with the 
tension of the vocal chords adjusted so that they vibrate in a relaxation oscillation, 
thereby producmg periodic pulses of air which excite the vocal tract This fundamen- 
tal frequency of vibration of the vocal chords is called the pitch frequency Examples 
of voiced sound include /i/ as in bead, /a/ as in ask, /w/ as in we etc Fricatives 
or unvoiced sounds are generated by forming a constriction at some point in the 
vocal tract (usually towards the mouth end) and forcing air through the constriction 
at a high enough velocity to produce turbulence This creates a broad spectrum 
noise source to excite the vocal tract. Examples include /j/ as in ship, /s/ as in sap 
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etc. Plosive sounds result from making a complete closure (agam, usually towards 
the front of the vocal tract), building up pressure behind the closure and abruptly 
releasmg it This rapid release of pressure causes a transient excitation Examples 
of plosives are ft/ as in fie, /p/ as in pie etc. As air travels down the vocal tract, 
the spectrum of the sound is shaped by the resonance frequencies of the vocal tract 
tube which are called its formant frequencies or simply formants. 

The speech production model of a source exciting a time varying linear filter 
IS based on this idealized description of the production mechanism The vocal 
tract (and radiation effects at the Ups) are accounted for by the time varying linear 
filter Its purpose is to model the resonance effects of the vocal tract tube The 
excitation generator creates a signal that models the glottal wave Linear prediction 
(LP) analysis [114], [96], [3], [99], is used to determine an all-pole model of the filter 
which IS usually block adaptive. The problem of designing an optimal linear predictor 
in the mean squared sense is formally equivalent to the problem of estimating the 
coefficients of an autoregressive model from time series using the minimum mean 
squared error (m m s.e ) criterion and which has a rich literature The all-pole filter 
model IS a reasonably good representation of the vocal tract effects However, nasal 
sounds and fricatives require both poles and zeros for effective modelling This is 
taken care of by representing the effect of a zero by including more poles in the 
transfer function [3] 

The major difference between the various coders based on the production model 
lies in the determination of the excitation signal In vocoders, the excitation signal is 
either a random noise sequence or a tram of pulses whose periodicity is determined 
by a pitch frequency analysis In the analysis - by - synthesis class of speech 
coders, the excitation signal is determined in a more complex manner Indeed, the 
difference in the various coders in this class lies primarily in the method of generation 
of the excitation signal In Regular Pulse excited Linear Prediction Coding (RPLPC), 
the pulses in a frame length are equally spaced and their positions are completely 
specified by the position of the first pulse [87], [88] In Multipulse excited LPC 
(MPLPC), the locations and amplitudes of a fixed number of pulses in a frame are 
determined [87], [4] In Code Excited LP (CELP) coding, the excitation sequence 
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(or vector) is determined from gam and shape codebooks [87], [123] Vector Sum 
Excited LP (VSELP) coding is a special form of CELP coding, which reduces the 
search complexity of an excitation vector from a codebook [53] 

The time varying linear filter models the short term correlation structure or the 
spectral envelope of the speech signal The excitation sequence may be explicitly 
determined as m the above coding schemes or it may be derived from a linear filter 
which models the long term correlation or the spectral fine structure ot the speech 
signal [86], [87] This filter attempts to exploit the pitch period redundancy in the 
excitation signal and is particularly effective in the case of voiced speech In the case 
of CELP coders, the pitch redundancy in the excitation signal is either modelled by 
an excitation vector from an adaptive codebook [81], [30], or by explicitly using a 
pitch prediction filter 

The above discussion shows that linear prediction is a well entrenched concept 
in the design of speech coders The structure of these coders based on the idealized 
speech production model has been a very successful paradigm for speech coding It 
can be further gauged from the fact that most present day speech coding research 
efforts are concentrated on fine tuning this model structure Before we develop a 
framework for nonlmear analysis and modelling of speech for coding applications it 
is necessary to consider the arguments for such a study in the first place We do so, 
on a rather qualitative basis, in the following section 

1.2 A Case for the Study of Nonlinear Analysis and 
Modelling of Speech 

There are various indications that suggest the time is ripe tor bringing in nonlinear 
tools for analysis and modelling of human speech We consider these in the following 
points. 

(a) Observations from the speech production mechanism 

In the idealized speech production model, the time varying linear filter models the 
vocal tract characteristics and the source models the glottal excitation A detailed 
model of the vocal tract must consider the time variation of the vocal tract shape, 
losses due to heat conduction and viscous friction at the vocal tract walls, softness 
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of the vocal tract walls, radiation of sound at the lips and nasal coupling The hme 
varying nature of the linear filter attempts to incorporate the nonstationary nature 
of the speech signal. The resonances of the vocal tract are reasonably well modelled 
by the poles of the linear filter Also it can be shown that the bandwidths of the 
lowest formant frequencies (the first and the second) depend largely on the vocal 
tract wall loss and the bandwidths of the higher formants depend primarily on the 
viscous friction and thermal losses m the vocal tract and the radiation loss [114] 
However, it will be appreciated that the effects of some of the above factors in the 
vocal tract are largely nonhnear and a hnear filter can hardly be expected to do full 
justice 

There are several nonlinearities involved in the vibration of the vocal folds and 
the generation of the glottal wave 

• Strong restoring forces act at the collision of the vocal folds 

• Dunng unvoiced sound utterances, the air flow from the lungs becomes turbulent 
as it passes through a constriction in the vocal tract [114], [42] 

• For many sounds, the vocal tract and glottal source do not interact greatly 
and changes in the vocal tract configuration do not greatly influence vocal fold 
vibration However, this is not so in the case of voiced sounds. Experimental 
observations show that at or near the frequency of the first formant there exists 
a nonlinear coupling between the source and vocal tract [42] Further, it has 
been shown that the celebrated two mass model of vocal fold vibrations [73] 
exhibits bifurcations and chaos [67] 

• A relatively recent study compares the performance of the output (model of the 
glottal waveform) of a linear model with a nonlinear model when excited by a 
train of pulses [120] It is shown that only a nonlinear model can accurately 
model the dependence of the output on the amplitude and frequency of the 
driving function. 

(b) Observations from the speech signal 

It IS well known that the speech signal is not ideally modelled by a Gaussian density 
function [75], [114] Thus, a linear prediction scheme to estimate the present speech 
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value as a function of previous values of the speech waveform is not optimal in the 
mean squared sense Further, there has been no systematic study ot the higher order 
statistics of speech There are various practical problems associated with such an 
effort However, an experimental investigation shows that the third and fourth order 
distributions of the speech signal can at best be considered as mixtures ot Gaussian 
distributions where the mixing is controlled by the past of the process rather than a 
single multivanate Gaussian distribution [48] 

Studies of the time waveform or state space plot (Chapter 2) of sustained ut- 
terances of vowels show that they are nearly penodic, although small perturbations 
are always present even in successive periods Tracking this small variability is of 
importance for reproducing the naturalness of speech There are large deviations 
from periodicity in both normal and pathological voice as well These perturbations 
are variously termed as hoarseness, harshness, raucous voice, husky or creaky voice 
etc. and are characterized by the sudden appearance ot subharmonics when the fun- 
damental or pitch frequency suddenly drops to one-half or one-third the preceding 
frequency Such transitions can be regarded as manifestations of bifurcations of the 
underlying dynamical system The backward transition from a subharmonic regime 
to a high pitched one (of double the frequency) usually occurs via an episode of 
nonperiodic oscillations which might possibly be related to a chaotic transient For 
a detailed discussion of these perturbations, see [67] and references therein 
(c) Limitations of a linear model 

The observation of speech signal characterishcs also suggests the limitations of a 
linear modelling scheme In speech signal modelling for coding, not all the structure 
IS captured in the model (i e the time varying linear filter) The residual (or the filter 
excitation) carries information about the fine spectral characteristics ot the signal 

Given that one is interested in capturing as much relevant information as possible 
in the model itself, it is pertinent to ask the question as to what are the limitations of 
a linear model? For this, let us consider a time series n = 1,2, to be modelled 
A general finite order model is of the form 




®n— d) — Vn 


(11) 
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wh>ere / is some representahonal function, d is the model order and e„ is the error 
in model fitting at time instant n In stochastic modelling, the observed time series 
Xn is considered to be a realization of a random process X„ whose joint probability 
distribution is possibly known The best possible model (of the random process) will 
reduce e„ to a realization of an independent, identically distributed (i i d ) process 
or 'strict' white noise The aim of stochastic modellmg scheme is not to model the 
time senes (or waveform shape) itself, but to model the stahstical moments In a 
linear stochastic model, one can at best capture the second order moments of the 
time series 

In a deterministic modelling scheme, the attempt is to model the waveform shape 
Itself In this case, the laws governing the time series evolution are supposedly 
known and (as in eq (1 1)) represents the variable that encompasses information 
about all the unknown degrees of freedom at time instant n An autonomous 
linear dynamical system or difference equation (see, for example. Appendix A and 
references therein for a review of dynamical systems terminology) is not capable of 
modelling such interesfang dynamical phenomena as limit cycles, jump phenomena, 
amplitude dependent frequency, chaos etc 
(d) Advances in nonlinear analysis and modelling techniques 

There has been a marked increase m research activity in both stochastic and deter- 
ministic nonlinear time series analysis and modelling since the last decade This is 
partly because the theory of linear time series modelling is now fairly well advanced 
but the ability of Imear models to represent patently nonlinear behaviour is limited 
Also, a major hmderance m the advancement of nonlinear techniques was the limited 
availability of computahonal resources which has been overcome to a large extent in 
recent years. 

In the case of stochastic models, nonlinear extensions include bilinear models, 
threshold autoregressive models, exponential autoregressive models, state dependent 
models (SDM) etc [142], [111], [64]. The estimation of model parameters requires 
knowledge of higher order joint probability distribution / moments of the underlying 
process Consequently, more moment information is sought to be modelled compared 
to the linear case Attempts to generalize nonlinear models led to the proposal of 
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stochashc SDMs [111]/ [64] which include, as special cases, all the linear (AR, MA, 
ARMA) models and nonlinear models named above 

In applications to speech coding for compression, one is more interested m 
deterministic modelling A straightforward method to build nonlmear deterministic 
models is to choose an adhoc representational form and use a minimization criterion 
to fit the given time series to it It would be more desirable to have a systematic 
procedure for model building For this, we draw upon the theory ot dynamical 
systems The bounded, steady state behaviour of dynamical systems can be catego- 
rized into one of the followmg equilibrium point, periodic solution, quasiperiodic 
solution and chaos (Appendix A) It is primarily the development ot a coherent 
mathematical basis for the description of chaos in nonlinear deterministic dynamics 
(for example, see books [63], [34], [125], [103], [117] and review articles [36], [108]) 
in the last one and a half decades that has facilitated a meaningful attempt ot the 
inverse problem of modelling In this framework, we consider the given time senes 
Xn (in our case, the speech signal) to be a scalar projection (or an observable) of an 
evolving dynamical system (i e the vocal tract system) trajectory The modelling 
problem is to find a representative time series x„ using a phenomenological nonlinear 
state space model to approximate x„ The efficacy of the model is determined by 
its predictive capability. Certain observations in the theory of chaotic dynamics can 
be usefully exploited to build nonlinear state space models We will consider these 
observations and subsequently the steps to build deterministic nonlinear state space 
models in the next two sections 

Before we end this section on a case for nonlinear modelling of speech, it is 
imperative that we point out the general shortcomings of a nonlinear modelling 
scheme as well The following points are worth noting 
1. There are infinitely many nonlinear representational forms 

2 There does not exist a global theory of nonlinear modelling and filtering The 
approach is to study specific representational forms and develop the theory tor 
those which give better performance in terms of desirable functions 

3 The superposition principle is not applicable to the nonlinear case 
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4 The stability of a linear model is determmed by the transfer function only 
However, for the nonlinear case, stability depends on the inihal conditions and 
the excitation function as well Unlike a linear model, a nonlinear model may 
be stable in some region of the state space and may not be so in another region 

1.3 Randomness, Determinateness and Predictability 
in Deterministic Dynamics 

Traditionally, the theory of deterministic dynamical systems and the theory of random 
processes have been treated as separate subjects with little or no overlap It used 
to be implicitly assumed that "random" behaviour was due to extreme complexity 
of the underlying system If a system is sufficiently complicated, with a large 
number of irreducible degrees of freedom, then from a practical point of view it 
becomes impossible to model determmistically - it would be simply not feasible to 
make enough measurements, much less simulate the model In this case, a random 
process model allows us to do the best job we can by lumping many degrees of 
freedom into probability distributions involving only a few variables. Due to the 
Ignorance of the neglected degrees of freedom, predictability is limited, but the model 
IS at least tractable 

The first lesson of chaotic dynamics is that randomness does not necessarily 
involve an enormous number of independent degrees of freedom In the presence 
of nonhnearity, only a few mdependent variables are sufficient to generate chaotic 
motion A chaotic time series can pass all "linear" tests of randomness The second 
lesson of chaos is that apparent random behaviour may be deterministic but at 
the same hme determinism does not imply predictability upto infinite time As an 
illustration, consider the bmary left shift map, 

Xn+i = 2xn mod 1, xq € (0, 1) (1 2) 

which IS piecewise hnear on (0,1/2) and (1/2,1). Given the initial condition xq, this 
map appears to be guilelessly determinate This simple difference equation has an 
equally simple analyhc solution 


x,i = 2”xo mod 1 


( 13 ) 
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Let the tnihal condthon x„ e (0,1) be spectfied as a b.nary representation (It r, ,s 
rattonal, then its bmary representahon ts asymptotrcally penod.c However, .t r. ,s 
irrahonal, .ts b.nary representahon never repeats .tself Moreover, .rrahonal numbers 
form a full measure set on the un.t .nterval ) The torward iterates ot eq (12) are 
generated by merely moving the decimal point sequentially to the right, each hme 
droppmg the integer part to the left of the decimal point Almost all trajectones 
of this determmtshc equahon are chaonc Usually the initial condition is known 
with finite accuracy, say b bits. In such a case, iterating eq (1 2) causes informahon 
about the mihal condlhon to be lost after b iterations Thus, although we have a 
determmisnc evolution law and a specified initial condition, exact prediction atter 
some iterahons is not possible due to the Bnite accuracy ot the initial condition Only 
probabiUshc statements can be made about the tra)ectory atter b iterations Coupled 
with the problem of hnite accuracy of imnal conditions, it is the sens, hue dependence 
of trajectones on the mihal condition and folding of trajectories (631, (34), (1251, (36), [108], 
[117] the two feanrres which may be found in smtple nonlinear dynamical systems, 
that give rise to complicated, aperiodic solutions. 

Thus, chaos is a double edged sword On the one hand it tells us that apparent 
random behaviour m time series may possibly be modelled in a compact, deterministic 
dynamical system or difference equation model involving a few degrees ot freedom, 
but on the other hand it cautions us that predictability of the model may be limited 
m time When only a few degrees of freedom are involved we can model the 
short term behaviour deterministically. In such cases, one can make short term 
predichons that are possibly better than those from a random process model Thus, 
apart from capturing regular time series behaviour (periodic and quasipenodic), it 
IS the possibility of modelling complicated aperiodic behaviour through nonlinear 
dynamical models that makes this approach both exciting and relevant for such 
applications as data compression The reader is directed to an interesting article 
on the issue of randomness of determinishc systems, titled "How random is a coin 
toss?" by J. Ford [43] Also relevant in this context are references [37], [41], [83] 
With the onset of a systematic study of chaotic behaviour in dynamical systems, 
several critena have been developed to determine the degree of complexity or 
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randomness of hme series Let us look at them briefly m the perspective ot the 
various interpretations of the concept of randomness through the ages The main 
thesis invariably consists of regarding randomness as the "absence of laws" Several 
viewpoints regarding randomness have now been formulated, most of which are 
qualitatively similar 

The first one is the set theoretical approach on which the modern theory ot 
probability is based In this approach, the concept of randomness is associated 
with the possibility of ascribing to a given quantity a probability measure, i e a 
quantity is said to be random if it is determined by its probability distributions The 
"absence of laws" is reflected here by its spread Determinate quantities correspond 
to distributions described by S functions In a second order analysis, uncorrelatedness 
IS often equated with randomness However, an uncorrelated process need not be 
unpredictable Consider, for example, successive iterahons of the nonlinear map 
Xn+i = 4sn(l - 3^0 ^ (0, 1) The resultmg sequence is a realization of an ergodic 

process whose autocorrelation function < Xn Xn+k > — < Xn >^= 0, unless A: = 0 
Hence, it is uncorrelated, but at the same time predictable In as much as we are 
concerned with predictabihty of models and data compression, the most general 
notion of randomness in this framework is associated with 1 1 d processes which are 
completely unpredictable 

An alternative approach to this concept is the mterpretation of randomness as an 
algorithmic complexity This approach was developed independently by Kolmogorov, 
Chaitin and Solomonov [28] As a measure of complexity of a given binary sequence 
Xn, n = 1,2, ,N, It IS proposed that one take the length of the program L (in bits) 

which generates the sequence If the program is small, then L is significantly 
shorter than the length of the sequence a:„, n = 1,2, , AT, so that it can be regarded 

as nonrandom In the opposite case, when L ~ A/, the program essentially reduces 
to recording the sequence itself symbol by symbol The complexity of the 
corresponding program can serve as the basis for classifying a given sequence as 
random Maximum complexity sequences are so unpredictable and incompressible 
that the term "random" seems appropriate From the viewpoint of complexity, almost 
all sequences x„ are random since simple programs form a set of measure zero, much 
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like rational numbers on the real line There is also the practical problem ot actually 
estimating the complexity of a sequence or time series An additional ditficulty is 
the invariable presence of "noise” which makes even algorithmically simple processes 
algorithmically complex 

In experimental situations various empirical notions of randomness are used 
Some qualitative ones are nonreproducibihty (impossibility of obtaining the same 
realization of a process under identical external conditions), nonrepeatability (which 
can be interpreted both as nonreproducibihty and the absence of periodicity in 
the given process), noncontrollabihty and nonmomtorabthty (impossibility ot creating 
conditions under which the process would proceed in a prescribed manner) On 
a quantitative level are the time average definitions corresponding to statistical 
ensemble based descriptions of randomness The most widely used criteria are 
based on associating the quantification of randomness with the degree ot absence ot 
periodicity. Examples of this are the criteria of "decaying correlation" or that of a 
"flat spectrum" While these criteria can distinguish between periodic and aperiodic 
behaviour m tune senes, they are not sufficient to differentiate between aperiodicity 
(deterministic chaos) and randomness as discussed earlier in this section 

The answer to whether a time series is truly random (in the sense of un- 
predictability) or IS aperiodic but chaotic (and hence predictable) is provided by 
estimating nonlmear dynamical invariants such as dimension (fractal, information, 
correlation dimension etc ), metric entropy and Lyapunov exponents of the underly- 
ing system [37], [41], [40], [125], [61]) The notion of dimension in dynamical systems 
IS associated with the number of degrees of freedom that a system possesses The 
number of state variables needed to describe a dynamical system is known as the 
nominal degrees of freedom While a dynamical system may have many nominal 
degrees of freedom, the trajectories may settle down on or approach a subset of 
the state space called the attractor Attractors exist only for dissipative dynamical 
systems and can be any one of four types, namely, fixed point, limit cycle, q-torus for 
quasipenodicity and strange attractor in the case of chaotic dynamics The dimension 
of these objects on which the steady state dynamics settles down gives the effective 
degrees of freedom. A deterministic state space model of the steady state behaviour 



I Introduction 


17 


of a dynamical system is advantageous over a random process model in terms of 
predictive capability, data compression etc only if the effective degrees of freedom 
IS small, which in turn ensures that a few variables are needed to model the steady 
state trajectory A large dimensionality means that the trajectory is "complex" and 
has numerous degrees of freedom In such a case a random process model may 
perform better 

Metric entropy quantifies the rate of loss of information about the inihal condition 
as the dynamical system evolves The predictability time of a dynamical system is 
proportional to the ratio of the loganthmic precision of the initial condition to 
the metric entropy Lyapunov exponents give a measure of the local stability 
properties of a trajectory If a trajectory evolves in n-dimensional state space, then the 
characterization is done through n exponents, usually arranged m decreasing order 
They categorize bounded trajectones into equilibrium points, periodic solutions, 
quasipenodic solutions and chaotic solution A positive Lyapunov exponent of a 
bounded trajectory indicates that it is chaotic For one dimensional maps, the metric 
entropy is equal to the Lyapunov exponent The binary left shift map of eq (1 2) has 
a metric entropy of one bit per iteration Thus, its predictability time is proportional 
(equal, in this case) to the loganthmic precision of the initial condition, i e b bits 

While the definitions of dynamical invariants refer to the dynamical system in 
question, their estimation from hme series (the observable of the dynamical system 
evoluhon) is made possible by a set of powerful theorems proposed by Takens in 
1980 [133] Effectively, these theorems state that the dynamical invariants eshmated 
from the observable time series by suitably reconstructing a state space trajectory will 
be the same as those of the underlying dynamical system under certain topological 
conditions 

We classify the problems of state space reconstruction and estimation of the three 
dynamical invariants, namely, Lyapunov exponents, dimension and metric entropy 
from unit speech utterances, i e phonemes, as the first problem investigated in 
the thesis Linear analysis such as correlation and PARCOR function evaluation, 
modes of vibration from spectral analysis etc are not much relevant in a nonlinear 
modelhng exercise The estimation of dynamical invariants provides answers to 
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whether speech time series can be modelled by a simple deterministic state space 
model in the first place, and to other related questions 

1.4 Deterministic State Space Modelling 
of Time Series 

In this framework, the given time series n = 1, ,N is regarded as a scalar 
projection (or observable) of an evolving dynamical system trajectory The modelling 
problem mvolves two steps 

1. State space reconstruction: A state vector is an information set which fully de- 
scribes the system at a fixed time instant n If it is known with complete accuracy and 
if the system is strictly deterministic, then the state can contain sufficient information 
to determine the future of the system The goal of state space reconstruction is to 
use the immediate past of the scalar observable x^, to reconstruct a d dimensional 
current state vector x^ 

2. Nonlinear function approximation: A general state space model can be repre- 
sented in the form 


x„ = (146) 

where is a d-dimensional reconstructed trajectory given by eq (1 8) below The 
scalar output Xn is then a generalized projection of x^ The factors determining 
the closeness of the approximations based on some distortion criterion, are the 
representational forms of g and h and the inputs u^ and To model chaotic 
behaviour, the representational form of g must be nonlinear Various nonlinear 
representations are available for the approximation problem Some prominent forms 
are polynomials, rational polynomials, wavelets, neural nets, radial basis functions 
etc 

It IS worth nohng that a linear prediction coding scheme of the form 

d 

x,i = ^ ^ 4" iLfi ( 1 o) 
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where Xn is the output and Un is the input excitation, can be cast in the state space 
form of eq. (1 4) In particular, for a linear time invariant system, eq (1 4) can be 

written in the form 


+ b^Un 

Xn = + cF Vn 


(16a) 

(166) 


For the case of eq (1 5), A, b, c and d are respectively given by 


0 

0 


1 

0 


0 

1 


and x^ IS given by 


0 0 
0 0 


^ = 

0 0 0 

0 1 

(17a) 


,ad-i ad- 2 ad-3 

ai tto. 


6 = [00 01]^ 


(176) 

c = [00 01]^ 


(17c) 

d = 0 


(17d) 




(18) 


Predictive state space modelling can be divided into two categories, namely global 
and local predichon, according to the method by which the function parameters are 
estimated In a global prediction scheme, the function parameters are optimized over 
the vectors x^ over the entire state space, whereas in a local prediction scheme, the 
function g IS ophmized over a local volume in the state space where the predichon 
needs to be performed 

We have investigated the predichon properties of polynomial global and local 
linear state space predictive models for speech signals We study the latter modelling 
scheme in some detail and compare its performance with traditional LPC scheme 
Based on this study we propose a framework for medium bit rate, low to medium 
coding delay speech compression algorithm using local state space predichon This 
constitutes the second investigated problem of the thesis 
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1.5 Fundamental Limit to Signal Compression 

In as much as this is a thesis on a framework for speech signal compression, it is 
relevant to ask the question as to what is the fundamental limit to the coding rate 
below which no further meaningful compression is possible Rate distortion theory 
provides a mathematical foundation to this problem of source coding In general, 
the source - receiver pair (of Fig 1 1, for example) is characterized by a probabilistic 
model and a fidehty criterion to measure the degradahon of the coded output with 
reference to the source The rate distortion function, R{D), gives the minimum rate 
R at which information about a source can be transmitted subject to the constraint 
that it can be reproduced with an average distortion D [12], [17] According to the 
information transmission theorem, it is impossible to obtain an average distortion 
D or less unless R{D) < C, where C is the capacity ot the transmission channel 
For memoryless sources, one can usually compute R{D) analytically if the source 
probability density funchon (pdf) is known It can also be computed numerically 
from source output realizahons using Blahut's algorithm [17], [16] For sources with 
memory, one can capitalize on the inherent statistical dependencies to further reduce 
the minimum rate needed to achieve a specified average distortion However, the 
computation of the rate distortion function for sources with memory is a difficult 
task. Analytically, it is known only for a few joint pdfs such as the joint Gaussian 
density function and for specific distortion criteria [12] 

We consider the computation of a lower bound Rl{D) of the rate distorfion 
function R{D) for stationary ergodic sources with memory Specifically, Rl{D) = 
R\{p) + H - H{X) for discrete alphabet sources and Ri{D) = Ri{D) i- h - h{X) for 
continuous alphabet sources, where Ri{D) is the first order rate distortion function, 
H {h) IS the entropy rate (differenhal entropy rate) and H{X) ( h{X) ) is the entropy 
(differential entropy) based on the marginal probability density of the source X 
Even in the computation of this lower bound, the estimation of the rates H (fi) 
from finite data length N using histogram technique becomes difficult as statistical 
dependencies for larger and larger time frames are successively considered We give 
procedures to estimate the second order entropy rate H 2 { Hi < H ) and the second 
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order differential entropy rate ho {h 2 <h) using a method of generalized correlation 
sum. 

Many p d t models have been suggested for the speech process based on first 
order histograms The Gamma pdf based on the long term statistics, the Laplacian 
pdf based on the medium term statistics and the Gaussian pdf. based on the short 
term statistics are among the more popular ones [75]. An estimation of the first order 
rate distortion functions based on these pdfs and the absolute value and the m s e. 
distortion measures are available in [2] We have numerically computed the lower 
bound of the general rate distortion funchon (i e for sources with memory) with 
respect to the m s e distortion criterion for quantized speech sources of resolution 
6, 8 and 10 bits/sample using our proposed method sketched above 

These inveshgations constitute the third problem studied in this thesis. 

1.6 A Historical Note 

In this section, we attempt to record the recent investigations using tools from 
nonlinear dynamics for speech signal analysis and studies in nonlinear predictive 
modellmg of speech The preliminary proposition that advances in nonlinear dynam- 
ics especially m the development of tools for the analysis of chaos can be utilized in 
speech signal analysis in the context of modelling, coding or compression was made 
pnmarily m Tishby [140], Kumar [90], Kumar and Mullick [92], [91], [93], Maragos 
[98] and Bernhard and Kubin [13] The analysis of speech signal complexity in terms 
of dimension is reported in Tishby [140], Kumar [90], Kumar and Mullick [92], [91], 
Townshend [143], [144], and Bernhard and Kubin [14] Togneri et al [141] show 
that the space of trajectories of speech may be approximated by a four dimensional 
manifold which is nonlinearly embedded both in a space of LPC coefficients and 
in a filter bank space It has been argued and shown by Bandt and Pompe [7] 
that entropy profiles of speech signals provide a fuller description of its structure 
compared to the spectrum This is because entropy profiles are invariant with respect 
to a large class of nonlinear distortions of the signal. 

There have also been some efforts toward the study of nonlinear predictive 
modelling schemes for speech with an eye to coding In the category of global 
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predichve models, neural network based predictive models tor speech have been 
studied and proposed by Tishby [140] and Thyssen et al [139], fractal interpolahon 
models by Maragos [98], polynomial difference equation models by Quatieri and 
Hofstetter [113], Kumar [90], Kumar and Mullick [92], [91], second order Volterra 
alter models by Thyssen et al [139] and recurrent nets by Wu and Niran)an [150] 
In the category of local predictive models in state space are the pattern search 
prediction scheme due to Bogner and Li [18], the codebook prediction schemes of 
Wang et al [148], and Singer et al [129], the local prediction model ot Townshend [144], 
the Compromised Overlapping Neighbourhood Local Approximation technique of 
Kumar [143] and Kumar and Mullick [92], [93] and the nonlinear oscillator model of 
Kubin and Kleqn [89] 

Most of the above investigations in nonlinear prediction schemes tor speech 
record an SNR improvement of 2 to 3 dB over a comparable linear prediction 
scheme Moreover, the prediction residual is reported to be significantly "whiter" 
than the short term LPC residual Some of the local prediction schemes have been 
incorporated in DPCM and ADPCM speech coders and their performance studied 
[143], [89], [51] However, a systematic study of nonlinear predictive speech coding 
m the medium to low bit rate range has not been reported so tar to the best ot our 
knowledge. 

We will refer to the above papers on dynamical analysis and nonlinear predictive 
modelhng of speech in more detail at appropriate places in the thesis 

1.7 Organization of the Thesis 

The organization of the three problems investigated in the thesis is as follows In 
chapters 2 and 3, we discuss about dynamical analysis of speech signals and give 
results of the estimation of dynamical invariants for speech Some schemes tor 
nonlinear predictive modelling and coding of speech are discussed in chapters 4 
and 5 We give an algorithm for numerically computing a lower bound ot the rate 
distortion function for stationary ergodic sources with memory in chapters 6 and 
derive some conclusions in chapter 7 
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Chapter 2 begins with a discussion of the theorems that form the basis tor 
dynamical analysis of hme series data These theorems give generic conditions tor 
reconstructing a state space trajectory from a scalar observable of a dynamical sys- 
tem We next discuss two methods for optimal state space reconstruction based on 
singular value decomposition and redundancy criteria and use them to reconstruct 
speech trajectories and make observations. We also study the local stability prop- 
erties of speech trajectories by estimating the largest Lyapunov exponent from the 
reconstructed trajectones 

In chapter 3, we consider the estimation of two dynamical invariants, namely 
dimension and metric entropy, from time series data. The state space reconstruction 
theorems of chapter 2 allow us to get an estimate of these invariants from a scalar 
observable of the time evolution of a dynamical system We discuss the estimation 
of these invanants using a method of generalized correlation sum The results ot 
the numerical computation of correlation dimension and second order entropy from 
speech time series are also given and discussed 

Chapter 4 is the first of two chapters concerning nonlinear deterministic state 
space modelling of speech We first review the salient features of the analysis - by - 
synthesis class of linear prediction coders and discuss the basic structure and analysis 
steps of a CELP coding scheme Some model based analysis results indicatmg the 
presence of nonlmearities in speech are given We also give the analysis steps and 
results of out experiments on polynomial prediction of speech and its comparison 
with the usual linear prediction and make some observations 

In chapter 5, we study and compare with linear prediction, the properties of 
a state space based prediction scheme for speech called the Local State Prediction 
(LSP) scheme A natural method for incorporating LSP in a speech coder is to use 
it analogous to a backward adaptive scheme We propose a framework for low to 
medium delay speech coding in the medium bit rate range based on LSP This coder 
uses an analysis - by - synthesis scheme and is structurally similar to CELP and 
named as a Vector Excited Local State Prediction (VELSP) coder 

In chapter 6, we propose an algorithm for the computation of a lower bound 
of the general rate distortion function for stationary ergodic sources with memory 
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Both discrete and continuous alphabet sources are considered We give examples of 
the computation ot the lower bound trom random process realizations Although 
speech signal source is time varying, we use the algorithm to get an estimate of 
the lower bound of the rate distortion tunction tor quantized speech sources with 
respect to the mean square error distortion ciitenon 

Finally, we make some concluding remarks in chapter 7 and give some directions 
for further study 



Chapter 2 

Dynamical Analysis of 
Speech Signals — 1 


One of the powerful mathemahcal tools for the analysis of general hme series comes 
from the modern theory of dynamical systems in the form of a set of theorems 
These theorems give a method for reconstructing the evoluhon of a dynamical 
system m state space from the observation of a single degree of freedom We discuss 
the import of these theorems in section 2-1. While these theorems give generic 
condihons for reconstructing the state space trajectory from a scalar observable, they 
leave unanswered the question of how best to estimate the parameters needed for 
reconstruction We will consider two methods for optimal state space reconstruchon 
and use them for reconstructing speech trajectories m sections 2 2 and 2 3 In sechon 
2 4, we discuss a method for estimahng the largest Lyapunov exponent from hme 
senes and use it to study the local stability properhes of speech trajectories 

2.1 A Theory of State Space Reconstruction 

In this section, we will review the theory that forms the mathematical basis for 
state space reconstruction from a scalar hme series Figure 2 1 gives a schematic 
description of state space reconstruchon for the Rossler system in the chaohc regime 
The Rossler system is characterized by the set of equations 
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x = -z-u 
!/ = T + at) 

z-b-\-z{r - ij) (2 1 ) 

Figure 2.1(a) shows the projection of the trajectory on the (r.y) plane tor a = () 15,6 = 
0 2 and c = 10 0 Figure 2 1(b) shows the hme evolution of the scalar observable which 
in this case is the x-variable It is assumed that the scalar observable is a function 
of some (unknown) variable(s) of the dynamical system hguie 2 1(c) shows the 
reconstructed trajectory from the r-vanable in the (x(t), x(t+l s7)) plane There is 
a differentiable equivalence (which is greater than topological equivalence) between 
the objects on which the trajectories settle in fig 2 1(a) and tig 2 1(c) I he set of 
theorems to be presented as Facts below assert that it is possible to embed a scalar 
hme series in the reconstructed state space such that the asymptotic properties of the 
reconstructed dynamics and those of the original dynamical system are the same 
The fact that one can get information about a dynamical system by observing the 
temporal evolution of just one variable has led to the current explosion in research 
achvity in this field Qualitatively speaking, the observed variable is intimately related 
to the other dynamical variables and so its time evolution contains intormation about 
them This would then be the basis for estimating dynamical invariants such as 
Lyapunov exponents, dimensions and metric entropy from block stationary intervals 
of speech signals The theory assures us that they would be the same tor the 
underlying dynamical system, le, the vocal tract system configuration and the 
associated excitation during that utterance interval 

We refer the reader to Appendix A and references therein for a brief review 
of dynamical systems terminology The problem of state space rcLonstruction was 
proposed in [107] and put on a firm mathematical foundation by lakens [131] 

Consider a dynamical system defined by a smooth (at least ( '■) diffeomorphic 
map (a bijection, 1 e 1-1 and onto map) (j> M -+ M (if the system is discrete 
time) or by a vector field g on M, (if it is continuous time) Here, M denotes a 
compact manifold on which the system evolves Let its dimension, dim(M) -m, 
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le, every point on it has a neighbourhood homeomorphic to i?- The flow of the 
system is given by r(so) (discrete time) or (continuous time), where s„ is the 

initial system state Let h M R be a smooth function which induces an observable 
on the dynamical system This produces a scalar time series r„ - /i(o'‘(s„)) (discrete 
time observable) or x(t) = h{<l>{t,so)) (continuous time observable) Given the above, 
one can state the following 

FACT 2.1: Given a point p in M, there is a residual (open and dense) subset Cgp of 
positive real numbers such that for T G Cg.p, the positive limit sets of p tor the flow 
(j){t,so) and for the diffeomorphism <l){T,so) are the same That is, for T € Cg,p, we 
have that each point qeM which is the limit of a sequence ^;(f.,p), L G RJ^ oo, 
IS the limit of a sequence </>(niT,p),n, £ N,n, - ^ -xj ■ 

FACT 2.2: For a smooth diffeomorphic map A/ A/, and a smooth function 
h Af i?/ It IS a generic property that the map A/ dehned by 

^(<^,y)(so) = [Mso) h{4>{SQ)) A(p“'"fso))l^ 

= [aco Sl (2 2) 


is an embedding ® 

FACT 2.3: Consider a vector field g, a smooth function h, a point p in M and 
a positive real number T For generic g and h and T satisfying generic conditions 
depending on g and p, the positive limit set L^(p) is diffeomorphic with the set of 
limit points of the following sequence in 

^gAp,r = [^(^(^^>P)) ^('!^((^ + 1)^’P)) fi(<>((A. + 2r/!)T, p))|,^ ^^0, ,oc 

■ 

From Fact 21 we infer that the limit sets of the continuous time flow ';<(f,so) and 
those of the correspondmg discrete time mapping f/»"'(so) introduced by the process 
of samphng the flow at uniform mtervals T, are the same This is true for most 
choices of sampling hme The values of T, that do not preserve the limit sets are 
those that are commensurate with the system's inherent excitation modes In such 
cases, a small perturbation of the sampling time removes the problem 
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Fact 22 IS of direct relevance to the reconstruction problem It is based on Whit- 
ney's embedding theorem which states that every m-dimensional compact man- 
ifold M embeds in Embedding means that there exists a diffeomorphism 

$ M Here, it is stated that a specific map which uses time delays 

to reconstruct the state space, will work Takens has also proved that one can use 
time denvatives to create embeddings While is the largest space needed for 

reconstructing the dynamics (a sufficiency condition), it is often possible to do so only 
in R'^ (a necessary condition) The space in which the dynamics is reconstructed is 
called the embedding or reconstruction space, and its Euclidean dimension is called the 
embedding dimension 

Fact 2 3 follows as a corollary to Facts 2 1 and 2 2 Fact 2 1 asserts that the limit 
set of a continuous time flow is same as that of a discrete time one obtained by 
uniformly sampling the flow and Fact 2 2 gives a particular method for embedding 
discrete hme maps on manifolds into Fact 2 3 states that a uniformly sampled 

data set embedded into 72^"*+^ has the same limit set as that of the onginal system 
under observation 

Facts 2 1-2 3 provide a theoretical basis for reconstructing the state space from 
a scalar observable of a dynamical system. However, in an experimental situation 
or otherwise, where one is confronted with a time sequence of scalar observations, 
one does not have aprion knowledge of the dimension, m, of the manifold on which 
the origmal system dynamics evolves One way to overcome this problem during 
state space reconstruction is to mcrease the embedding dimension systematically 
until the trajectories in the corresponding embedding space do not seem to intersect 
Another practical problem is the choice of an appropriate hme delay between the 
scalar components of the reconstructed vector Theoretically, one can also choose 
the successively sampled data values as the constituents of the reconstructed vector 
as in eq (2 2), because of the premise that successive scalar measurements contain 
some new information about the dynamics However, experimental considerations 
like noisy measurements etc. require that some ophmality condihon be used in the 
choice of time delay In the next two sections, we review two such optimality criteria 
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from dynamical systems literature, discuss their relative merits and demerits and use 
them to plot state space trajectories ot speech signals on the reconstructed space 

2.2 Optimal State Space Reconstruction using 
SVD Criterion 

The observation process can be represented by 

s = g(s(f)) + A(t) 

x{t) = h{s{t)) + -lit) (2 3) 

where s is a d-dimensional state vector, /i is a scalar valued observation function, A{t) 
IS a d-dimensional vector which denotes additive dynamical noise and '>(t) denotes 
additive observational noise x{t) is a scalar valued continuous observable which may 
be uniformly sampled at intervals of T to produce the observed time series r{t + iT), 
or Xj, t = 1,2, Observational noise does not alter the deterministic state of the 
system whereas dynamical noise does A general time delay reconstruction in R" 
produces a reconstructed vector time series 

x"(t -h iT) = [x(f + iT) x{t + iT + kT) r.it -yiT (n - 

< = [x^x.+it x.^(„_ t = 1,2, (2 4) 

Rigorously speaking, the theory of state space reconstruction as discussed in section 
21 IS valid only for infinite amount of noise free data Otherwise, invariants are 
dependent on the type of reconstruction However, the theory is still useful in the 
low noise limit case In the following, we will assume that there is no dynamical 
noise during system evolution, i e , A(f) = 0, and the only source ot randomness 
IS observational noise Observational noise can arise due to finite precision ot the 
measuring instrument, inaccuracy of measurement due to filtering effects etc 

The process of embedding the dynamics in a reconstructed state space requires 
a choice of 3 parameters sampling time T, embedding dimension n and time 
delay k, as seen from eq (24) Qualitatively speaking, the state of a dynamical 
system contains all the mformation about it at a particular time instant Therefore, 
a good state space reconstruction is one in which the state vector contains the 
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maximum possible information about the dynamics to make its future as predictable 
as possible It is also desirable that the dimensionality of the state vector be as small 
as possible. Various state space reconstruction methods have been proposed based 
on singular value decomposition (SVD) [21], information theorehc criteria [46], [45], 
other statistical [25], geometrical [27] and topological considerahons [94] 

The method of smgular value decomposition (SVD) was proposed by Broomhead 
and King [21] for use m state space reconstruction SVD is used to arrive at an 
appropriate embeddmg dimension n Heuristic arguments are used to compute the 
window length, = nkT, which is the time span of the reconstructed vector (see 
eq (2 4)) From the observed scalar time series Xi, t = 1, ,N, we reconstruct I 

dimensional vectors x|,i = 1, ,N - l+l, where x[ = x^+,_,]^, and N is 

rather large The vector time series is used to form a trajectory matrix X 


1 

II 

,T iT 

Xi X2 

X,V-1+1 ] 


Xi 

X2 X, ■ 

, 1 

^2 

X3 X/4.1 


* 



.X;\f_;+1 
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The SVD of the {N — I + 1) x I matrix X allows us to decompose it m the form 

X = USV^ (2 6) 

where U is a {N ~ l + l) x I dimensional matrix such that its columns are orthogonal, 
V IS a I X I orthogonal matrix such that and S is a I x I diagonal matrix 

whose elements cri, ,cr„ are nonnegative and are called the singular values They 
are, by convention, arranged in decreasmg order, i e , crj > <73 > > cri > 0 If rank 

(X) = n < I, then cr„ > 0, and cTn+i = = cr; = 0 The columns of the matrix U 

comprise of the eigenvectors of XX^, I of which have nonzero eigenvalues The 
eigenvalues and eigenvectors of the matrix X^X are the square of the singular values 
and the columns of the matrix V respechvely Also, the I x I matrix X^X is the 
covariance matnx of the scalar components of x[ averaged over the entire trajectory 

= V E (2 V 
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If we assume that the elements of the matrix X come from some continuous 
distribution, then full rank matrices comprise a set ot maximal measure [101] Thus, 
one cannot expect X to be less than full rank The presence of noise in the time 
series causes a saturation m the singular value spectrum It, for example, a scalar 
time series x^ consists of a deterministic component i:, (= h{^{t,)), see eq (2 3)) and an 
additive white noise component i e , r, = x, + i -= 1, , N, then asymptotically 

[ 21 ] 

cr/ <^- >, I = 1, ,l (2 8) 

where is an eigenvalue of X'^X and a; is an eigenvalue ot X^'X, X being the 
trajectory matrix corresponding to the deterministic time senes r. The singular 
spectrum a, saturates to a noise floor value when ^ •> is significantly greater 

than This defines a threshold on the magnitude ot the singular values such that 
those values less than the threshold correspond to predominantly noise coordinates 
The embedding dimension n is taken as the effective rank ot the trajectory matrix 
corresponding to the number of singular values above the threshold, since this is 
the subspace spanned by the deterministic part of the time series 

The choice of window length, = nkT, where n is the embedding dimension, k is 
the time delay and T is the sampling time, is based on heuristic arguments In order 
to exclude more than one integer data period within a window length, it is sufficient 
that, for bandlimited data, Ty, < j, where fi is the bandlimiting frequency A lower 
bound on Ty, is given by Ty, > (2m+l)T, where rn is the dimension of the manifold 
on which the dynamics evolves However, since m is not known aprwri, one can 
use an estimate Ty, = j This gives an estimate of the product AT which is the time 
delay, in seconds, between successive scalar constituents ot the reconstructed vector 
in eq (24) Usually, the sampling time T is fixed by the measuring instrument, in 
which case, an mteger estimate of the time delay is A - [T,o/nT| where [ ] operation 
denotes the choice of the largest integer k such that nkT < T,„ 

The SVD method was one of the first proposed to reconstruct state vectors 
from experimental time series using certain optimality criterion However, there 
are limitations with this method The principal idea here is to determine those 
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coordinate directions (the singular vectors) along which the embedded trajectory has 
significant spread (given by the singular values) Recall that the singular values 
and vectors of X can be computed from an eigenanalysis of the covariance matrix 
X'^X This means that the orthogonality of the singular vectors translates to linear 
independence or uncorrelatedness in the limit of large apriori embedding dimension 
I Thus, two scalar time senes having the same covariance structure but different 
higher order moments cannot be distinguished by an SVD analysis Moreover if 
the scalar time senes x, is an observable of a chaotic dynamical system, then the 
embeddmg dimension n is equal to the apriori embedding dimension I, however 
large I is This is because in the limit of large I, the coordinates m SVD become the 
Fourier coefficients and in the case of chaotic dynamics, all the Fourier coefficients 
are nonvanishing and contain information about the dynamics It is suggested in 
[101] that the singular values be estimated directly from X rather than from the 
eigenvalues of the X^X matrix which may lead to possible numerical artifacts m the 
form of apparent convergence of the singular values due to machme precision 

The efficacy of the SVD criterion m state space reconstruction lies in noise 
reduction If one knows the noise level or the accuracy of the data apriori, then it 
IS prudent to discard that subspace of the trajectory matrix which corresponds to 
singular values below the noise threshold This method is also useful in plotting 2-d 
projections of state space plots because ideally it is desirable to project the embedded 
trajectory onto that subspace in which it has maximal spread 

We have studied this reconstruction scheme on unit speech utterances, namely 
phonemes of database 1 (Appendix B) Die SVD analysis was performed on 44 
consonants of the International Phonetic Alphabet (IPA) spoken by 4 trained persons 
(3 males and 1 female) and 8 cardinal vowels spoken 4 times by a single person (Daniel 
Jones) We excluded the 13 plosives from all analysis reported in the thesis because 
of their extremely short duration of utterance Thus, a total of 208 (44 x 4 + 8 x 4) 
phoneme utterances were used for all analysis work 

To form the X matrix for SVD analysis, we choose a window length T,u =05 
ms which allows us to use an apriori embedding dimension I = 9 at time delay 
^ = 1 Since the highest significant frequency content of speech may be greater than 
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— 2 Icpiz, it IS gcncncsUy possible th3t the reconstructions 3re bsd Choosing 
3 ccording to the argument given earlier will not allow us to retain a sufficiently 
large I This can be seen as follows - assuming the bandhmitmg frequency of speech 
to be /i = 4 kHz, we will have only 5 samples in the window at 16 kHz sampling rate 
Thus, the largest allowable value of I is equal to 5 in this case Figure 2 2(a) shows 
the normalized smgular value spectrum in log scale (log[o(<T,/ o’,), i 1, ^9) 

of the trajectory matrix X of two time senes corresponding to cardinal vowel /a/ 
and approximant /]/ Figure 2 2(b) shows the first three singular vectors of X 
correspondmg to /a/ These two graphs are typical of the SVD analysis done on 
all the phonemes of the database The embedding dimension n is chosen to be 
the number of singular values above the saturation floor and it varies from 3 to 
6 No significant difference in the mean value ot n is observed across the broad 
phoneme classes of nasal, trill, tap or flap, fricative, lateral fricative, approximant, 
lateral approximant and cardinal vowel It is noteworthy that the mean threshold 
value across the 22 fricatives is a factor ot 3 2 higher than that across the 8 cardinal 
vowels We mterpret this as the inability ot the SVD scheme to sitt the "more 
complex" deterministic part of the fricative trajectories from the noise component 
compared to the cardinal vowel case In the following section, we will discuss a 
reconstruction scheme based on the more general notion of independence rather 
than that of linear mdependence of the SVD scheme 

In figs 2 3-2 7, parts (a)-(d) each, we plot for five phoneme utterances, the 
respecbve time series, founer spectrum and the reconstructed state space trajectories 
usmg the SVD and mutual information criteria The five phoneme utterances 
corresponding to figs 2 3-2 7 are cardinal vowels /i/, /o/ and /a/, fricative /?/ and 
approximant /j/ respechvely In part (a) of each figure, 400 samples of the time 
series are plotted At 16 kHz sampling rate, this is equivalent to 25 ms of speech 
utterance In part (b), we plot the first 100 points of the corresponding 400 point 
founer spectrum The 2-d projections of the reconstructed trajectory onto the 1-2 
plane spanned by the first two singular values are shown in part (c) of figs 2 3-27 
We will comment on the reconstructed trajectories in part (d) of the respective figures 
m the next section One can make the following observations from these figures 



Dynamical Analysis of Speech Signals — 1 


35 




1 ^ ^ ^ ^ ^ ^ ^ ^ 1 

0 3 6 9 


Coordinate Index 

(b) 


(a) The normalized spectrum of singular values for (1) cardinal vowel /a/ and 
(2) approximant /v/ Data length N = 4000 and 1750 respectively Sampling rate 
1/T = 16kHz at 12 bits/sample (b) The first 3 singular vectors corresponding to 
cardinal vowel /a/ 
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1 For cardinal vowel utterance /i/ (dg 2 3), the pitch period delay is equal to 84 
samples Thus, the projected trajectory ot fig 2 3(c) contains slightly less than 
5 loops Figure 2 3(b) shows the presence of only one prominent frequency 
in the spectrum This accounts for the approximately circular shape of the 
reconstructed trajectory in fig 2 3(c) Note that the trajectory never exactly 
repeats itself. It is felt that the tracking ot this minor variability even m periodic 
and quasiperiodic utterances is important tor reproducing the naturalness of 
speech 

2 For cardinal vowel utterance /o/ (fig 2 4), the pitch period delay is equal to 89 
samples. There is a prominent frequency content at approximately double the 
pitch frequency which causes a looping in each cycle ot the trajectory (hg 2 4(c)) 

3 For cardinal vowel utterance /a/ (fig 2 5), there are comparatively more number 
of prominent frequencies m the spectrum The pitch period delay is equal to 
90 samples. There is a prominent frequency at approximately 5 times the pitch 
frequency Thus, within each of the four major loops of the projected trajectory 
(fig 2 5(c)), the three smaller loops are approximately one-fitth the length of the 
major loop. 

4 For fncative utterance /?/ (fig 2 6), the projection ot the reconstructed trajectory 
(fig 2 6(c)) appears to be "more complex" than that corresponding to cardinal 
vowels However, reconstruction using the redundancy criterion (hg 2 6(d)) 
appears to do a better job than the SVD criterion in this case We will comment 
further on this in the next sechon 

5 For approximant utterance /j/ (fig 2 7), the pitch period delay is equal to 142 
samples At 16 kHz sampling rate, this corresponds to a frequency ot 1127 
Hz The 3 major loops in the projected trajectory each correspond to a pitch 
period There are two loops in each major loop corresponding to the prominent 
frequencies at approximately double the pitch frequency 

In section 2 3, we discuss another reconstruction scheme that is hrmly grounded 
in theory We will also make some comparisons between the two reconstruction 
schemes in the next section. 
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x(t) 

(d) 


Fig. 23: For cardinal vowel utterance /i/, (a) part of the time series at sampling rate 
1/r = 16kHz, (b) the first 100 points of the corresponding 400 point tourier 
spectrum, (c) the projection of the reconstructed trajectory using SVD criterion 
on the 1-2 plane corresponding to the 2 largest singular values, (d) trajectory 
plot on the (x(t),x(t-yTj)) plane where Tj is estimated using minimum mutual 
information analysis 
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x(t) 

(d) 


Fig 2.4: For cardinal vowel utterance /o/, (a) part of the time senes at sampling rate 
1/T = 16kHz, (b) the Erst 100 points of the corresponding 400 point tourier 
spectrum, (c) the projection of the reconstructed trajectory using SVD criterion 
on the 1-2 plane corresponding to the 2 largest singular values, (d) trajectory 
plot on the (x{t},x(f + Td)) plane where Td is estimated using minimum mutual 
information analysis 
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Fig. 2.5. For cardinal vowel utterance /a/, (a) part of the time senes at sampling rate 

(b) the first 100 points ot the corresponding 400 point tourier 
spectrum, (c) the projection of the reconstructed trajectory using SVD criterion 
on the 1-2 plane corresponding to the 2 largest singular values, (d) trajectory 

plot on the (x(f),x(f + Td)) plane where Tj is estimated using minimum mutual 
information analysis 
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x(t) 

(d) 


Fig. 2.6: For fricative utterance/?/, (a) part of the time series at sampling rate 1/T - 16kHz, 
(b) the first 100 pomts of the corresponding 400 point fourier spectrum, (c) 
the projection of the reconstructed trajectory using SVD criterion on the 1-2 
plane corresponding to the 2 largest singular values, (d) trajectory plot on the 
{x{t),x{t + Tti)) plane where is estimated using minimum mutual intormation 
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x(t) 

(d) 

Fig. 2.7; For approximant utterance /]/, (a) part of the time sencb at sampling rate 1/T - 
16kHz, (b) the first 100 points of the corresponding 400 point touner spectrum, 
(c) the projection of the reconstructed trajectory using SVD criterion on the 1-2 
plane corresponding to the 2 largest singular values, (d) trajectory plot on the 
{x{t),x{t + Td)) plane where Td is estimated using minimum mutual information 
analysis 
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2.3 Optimal State Space Reconstruction using 
Redundancy Criterion 

This method is based on the requirement of general independence of the scalar 
variables of the reconstructed state space as compared to that of uncorrelatedness 
of the SVD method As an illustration, first consider the case of 2-dimensional 
reconstructed vectors, 

x^(f "f" tT) = “i" ■!“ 4" 2^)]^ 

or xf = [a;, x^+k]'^ (2 9) 

1 e , n = 2 in eq (2 4) The parameter of interest is the time delay Td = kT, in seconds, 
between the two variables If Td is chosen to be very small, the two vanables would 
convey nearly the same informahon and the reconstructed trajectory would he close 
to the diagonal = Xt+Jt It is suggested that Td is chosen as the first local minimum 
of the mutual informahon between them [46] 

The mutual information between two random vanables Y and Z is a funchonal 
defined by their joint probability density 

where pY and pz are the probability density functions of Y and Z respechvely, pyz 
is their joint probability density funchon and < > denotes the ensemble average 

The logarithm is generally taken in base 2 so that the mutual information can be 
expressed in bits It can be shown that I{Y,Z) = I{Z,Y) Also, I{Y,Z) >0 If 
Y and Z are mdependent, le, pY,z = Py pz then 1{Y,Z) =0 If on the other 
hand, ^ determines y exactly, then I(Y,Z) = oo. Thus, mutual informahon between 
two variables is a measure of the general dependence between them In the above 
example, the variables are time delayed versions of the same scalar observable It 
answers the question, "Given x{t), how many bits on the average can be predicted 
about x{t + Td)7" Since we have a deterministic hypothesis on the data, strictly 
speaking, the mutual information between the two observables should be infinite 
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The divergence is avoided due to the finite accuracy of measurements and the use 
of finite partitions of the state space which are reflected in the result 

The above concept of mutual information can be generalized to n random 
variables The redundancy between n random variables ,yji can be defined 

as a functional of their joint probability density [45], [44] 


PYuY 2 , ,r„ 


PYi Py„ 


= PYu XniVU ,2/n)log2(^ 

Jy„ \PY, 


PYi, ,y;(yi, .Un) 

{yi) PY^yn) 


dij\ 


dl/n 

(2 11 ) 


with the standard interpretation of the symbols It is possible to extend the min- 
imum mutual informahon type of analysis to determine good reconstructions in 
n-dimensions using the concept of redundancy From eq (2 4), treating the n scalar 
variables as random variables, we have 

y; = x{t + (z - i)kT) 

= s(t + (z - l)Td), z = 1, ,n, (2 12) 


where the time delay = kT In this case, let us denote the redundancy in compact 
notation by where n is the embedding dimension and Td, the time delay 
Redundancy gives a quantitative measure of the degree of dependence between the 
n time delayed scalar variables It is also helpful to define a quantity called marginal 
redundancy, 

B:p^ = Iix{t + nTd),x^{t)) 


on -hi y?n 

- 



\ Fx (£ friTd ) / / 


(2 13) 


This tells us how many bits of the last component ot x'‘*‘(t) can be predicted, on 
the average, from the previous n components 

The idea of a good reconstruction is that a point on the reconstructed state space, 
x”(t), provide as much useful information about the original state space point s(t), 
as possible (see eq (2 3),(2 4)) That is, the conditional probability density Ps(fj/x"(o is 
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as sharp as possible Since the state space corresponding to the original dynamics 
IS generally not accessible in experiments, the above criterion can heuristically be 
changed to the requirements that the conditional density Px(t+nr<,)/x’>(t) be "as sharp 
as possible" and the scalar variables of x^(t) be "as independent as possible" [107] 
A suboptimal algorithm to implement the above criteria may be arrived at in two 
successive steps 

1. Choose an embeddmg dimension n so as to maximize the marginal redundancy 
This ensures that Px{t+nTd)/x’'{t) is "sharply peaked". 

2 Choose the appropnate time delay Td as the first local minimum of the plot ot 
redundancy versus tune delay Tj 

This algorithm is somewhat at variance with that origmally proposed m [45] In 
that paper, it is argued that, with increasing scalar measurements, new information 
about the original system dynamics may get shifted from macroscales to microscales 
and, hence, be irrelevant Based on this argument it is stated that redundancy by 
itself does not represent the useful mformation about the dynamics For example, if 
the dynamics is chaotic, 

n 

= fJpx((„-or,) (2 14a) 

hm BJr=0 (2 145) 

Td-^oo ^ 

Thus, minimizing the redundancy would mean choosing the largest possible value 
of Td However, this is not an appropriate choice because it relates disparate length 
scales in the original system and the reconstructed dynamics As a remedy, a statistic 
IS suggested in [45] that discounts for useless information regarding small scale 
features However, as a good first approximation, we endorse the choice of Td as 
the first local minimum of the redundancy plot, as argued in [95] 

We have used the minimum mutual information criterion to find the optimal 
time delay Td for plotting 2-d reconstructed trajectories in state space A similar 
study using the redundancy criterion on sustamed vowel utterances [a ,e ,i ,o ,u ] by 
3 male speakers is reported in [14] Our analysis was done on the same database as 
used in section 2 2 The mean time delay across 4 speakers varies from 0 19 ms for 
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tricatives /X/ and /h/ to 1 01 ms for nasal The time delay Td over 44 consonants 
of the IPA (excluding the 13 plosives) across 4 speakers is obtained as 0 56 ± 0 27 ms 
Figures 2 3-2 7, part (d) show the 2-d reconstructed trajectories in the x(f) - x{t + Td) 
plane for three cardinal vowels /i/, /o/ and /a/, one fricative /}/ and one approximant /j/ 
respectively The following observations of this analysis and reconstruction scheme 
are noteworthy: 

1 The mean time delay for the 8 Daniel Jones cardinal vowels spoken 4 times each 
is 0 55 ms while that for 22 fricatives across 4 speakers is 0 42 ms The difference 
m Td between cardinal vowels and fricatives is intuitively expected because the 
regular nature of the former allows it a greater predictability time compared to 
fricatives 

2 The variation in Td across four repeated utterances of the same cardinal vowel 
and across four speakers of the same phoneme is relatively small The variation 
m Td IS less than 0 13 ms across 4 utterances of the same cardinal vowel tor 7 ot 
the 8 vowels. Similarly, the variation m Td is less than 38 ms across utterances 
of the same phoneme by 4 speakers (3 males and 1 female) in the case ot 33 ot 
the 44 phonemes of the IPA that were analysed. 

3 A comparahve study of the reconstructed trajectories of the relatively regular 
cardinal vowels (fig 2 3-2 5, parts (c) and (d)), shows that there is little to 
distinguish between the two reconstruction schemes in this case However, in 
the case of more complex fricatives, the redundancy criterion appears to give 
consistently better reconstructions as far as visual comprehension is concerned, 
compared to the SVD cnterion This fact is better appreciated if one observes the 
evolution of the reconstructed trajectories, say, on a computer screen We see in 
fig. 2 6(d), for fricative utterance /?/, the reconstructed trajectory reproduces the 
periodicity corresponding to the prominent frequencies in the founer spectrum 
of the utterance This periodicity will be visible in the SVD method also (hg 
2 6(c)) if we view along the 1-axis rather than orthogonal to the 1-2 plane 

The SVD method chooses a good reconstruction from a larger set ot possible 
reconstructions compared to the mutual information method This is because the SVD 
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method gives an optimum basis set for reconstruction in addition to an appropriate 
time delay It also has an inbuilt ability for noise reduction. However, the mutual 
information method, which uses the criterion of statistical independence gives better 
reconstructions for specific examples [44] This is, however, not so in general [26] 

After reconstruchng the trajectory from a scalar time series, we will now discuss 
the method of Lyapunov exponents which is used to characterize the steady state 
behaviour of dynamical systems We will also give results of the estimation of the 
largest Lyapunov exponent from reconstructed speech trajectones 

2.4 Lyapunov Exponents 

Lyapunov exponents (also called characteristic exponents) give a coordinate indepen- 
dent measure of the local stability properties of a trajectory If the trajectory evolves 
in an n-dimensional state space, then the characterization is done through n ex- 
ponents, usually arranged in decreasing order, and referred to as the "spectrum 
of Lyapunov exponents" Stability characterizes response to perturbations and can 
refer to individual points (local stability), to trajectories (asymptotic local stability), to 
famihes of trajectories (attractors) or to an entire dynamical system (whether there is 
a unique attractor or not). Conceptually, Lyapunov exponents are a generalizahon of 
eigenvalues used in charactenzing different types of equilibrium pomts They cate- 
gorize bounded trajectories into equilibrium points, periodic solutions, quasiperiodic 
solutions or chaotic solutions The first three types of steady state trajectories are 
called regular trajectories They are asymptotically locally stable and are characterized 
by a spectrum of nonpositive Lyapunov exponents An equilibrium solution has a 
Lyapunov exponent spectrum of the form (-,-, ) A limit cycle (example of a 

periodic solution) has a spectrum of the form (0, — , — , ), while a two-torus (example 

of a quasiperiodic solution) has a spectrum of the form (0,0, -, -, ) In contrast, a 

chaotic solution which is asymptotically locally unstable, is defined by the presence 
of at least one positive Lyapunov exponent. 

2.4.1 Theory and Evaluation from Scalar Time Series 

Let us first consider the case where the dynamical equations are explicitly known 
[36], [50] We take a discrete time dynamical system 

Sfc+i = f(sfe) (2 15) 

„ZNTRAL LiBRARIf 

I I. T., KANPUR 

“ — I — f ' V 
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where f i?" i?™ is a differenhable vector hinction Consider the growth ot an 

arbitrary smaU perturbahon Aso about a point sq The one step evolution ot this 
perturbation can be approximated as follows. 

Si = f(so) 

Si ~1~ Asi = f(So "h Asq) 

Asi ~ (ds^f) Aso 

= J, Aso (2 16) 

where J, is the Jacobian of f Each of the eigenvectors of J, locally grow at a rate 
where Aj,j = 1,2, ,n are the eigenvalues of the matrix The growth ot the 
perturbation after i steps is given by the iterate f ot f, and can be approximated 
by its Jacobian, J* Using the chain rule, 

As, ~ J‘ Aso 

= (ds,f )Aso 

= (ds...f)(d,..,f) (d^fJAso (2 17) 

The Lyapunov exponents are given by the asymptotic growth rate of the eigenvalues 
of Jl 

Aj = hm -log||J^||, j = 1. ,n (2 IM) 

1-+00 t 

The existence of the limit is assured by the mulhplicative ergodic theorem ot Oseledec 
[106] The above definition can also be extended to continuous time dynamical 
systems 

Lyapunov exponents can thus be calculated by linearizing the system ot equations 
and applying it to small perturbations about a given numerical solution When the 
system equations are not known, it is possible to get the Lyapunov exponents 
from a scalar observable of the evoluhon of the dynamical system This consists ot 
reconstructing the state space, following the evolution of the reconstructed trajectories 
that are close to each other and fitting the data in local neighbourhoods to approximate 
J] The reader is referred to [35], [119] for algorithmic details While in those papers, 
linear maps are used, a later work [22] shows that it is advantageous to use higher 
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order Taylor series for local mappings We have, however, used a different algorithm 
for compuhng the largest Lyapunov exponent from speech time series. This is 
because the algorithm is more robust to changes in input parameters In contrast, 
the previous algorithm has been found to yield Lyapunov exponent estimates that 
vary considerably with the embedding dimension [147] To discuss this algorithm, 
let us consider a more formal definihon of Lyapunov exponents [63] 

Let f J?" — > define a discrete dynamical system Assume f is Fix 

s e Let f(s) denote the zth iterate of f Suppose that there are subspaces 
3 D D in the tangent space of f (s) and numbers Ai > A 2 > > A„ 

that depend on s with properties that 

(a) = 1 = 0,1, ,J = 1,2, ,n (2 19a) 

(h) dimV^^^ = n + l- j, i =0A, ,J = 1,2, ,n (2196) 

(c) hm„_oo k log I19sf*(s^)ll = A„ for all G lIsJH = 1 (2 19c) 

then Aj,j = 1, ,n are the n Lyapunov exponents of f 


As an example, consider for n = 2, 


dsf = 


ai 

0 


0 

a2 


(2 20 ) 


where, s is a fixed point and ai > a 2 . Pick FO) = span[(0, 1), (1, 0)} = R^, and 
F(2) = span[(0,l)} Then = F(2),t = 0,1, satisfy (a) and (b) 

Also, Aj = logaj,j = 1, 2 In general, consists of all vectors which grow 

at the fastest possible rate, - fJ^^ consists of all vectors which grow at the 
next fastest rate and so on. Alternatively, Fg^'^ is the direct sum of eigenspaces 
corresponding to At,i = 1, ,n is the direct sum of eigenspaces corresponding 
to Ai,t = 2,3, ,n, where Ai > A 2 and so on [10] Smce dtmV^^^ = n, if one chooses 

the vector s-' in (c) "randomly", it is most likely to lie in which is of full 

Lebesgue measure. This vector will evolve at a rate given by the largest Lyapunov 
exponent. Similarly, an area element defined by three points in the state space 
would evolve at a rate proportional to A d-dimensional volume element 

would grow at a rate depending on the first d eigenvalues This gives a method 
for estimating the Lyapunov exponents If the system equations are known, then 
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one can monitor the evolution ot a set ot small, initially orthogonal vectors (called 
principal axis vectors) about a central trajectory (called the hdiaial trajectory) The 
growth rate of a single principal axis vector gives Ai, the growth rate ot the area 
spanned by two vectors gives Ai +• A> etc. The exponential dominance ot the largest 
eigenvalue requires that the principal axes be periodically leorthogonali/ed so that 
they remain resolvable in the finite precision ot computers 

The above discussion suggests a procedure to evaluate the largest Lyapunov 
exponent from a scalar observable ot a dynamical system Lor algorithmic details, its 
implementation aspects and justifications, see [149] 1 lere, we conhne to enumerating 
the steps of the algorithm Given time series r,, i - I, , N, tollowmg are the steps 
(see also fig 2 8) 

STEP 1: Reconstruct the trajectory in n-dimensional state space using time delay 
embeddings 

x.'^ = [x,x,+k I 1, ,N in 1}L (2 4) 

STEP 2: Start the algorithm by locating the nearest neighbour x", whose distance 
is greater than a prefixed distance parameter SCALMN, to the initial point on the 
fiducial trajectory x" Let = ||x" - x)‘J| SCALMN ensures that distances below 
the noise scale are not used in the computation 

STEP 3; Set = ilxj+g — x((^^||, where q is the prefixed number ot steps tor which 
the pair of points are to be evolved Store L = 

This completes the initial run Now enter the main loop 

STEP 4: A replacement point xj*^ is found as follows The distance between the 
replacement point x" and the fiducial point x(\^, i e , dj"’ Hx)'. ^ - x" || is the smallest 
distance greater than SCALMN, and less than a prefixed parameter SCALMX to ensure 
that the mitial principal axis vector is not large Also, the angular orientation 
between the replacement vector and the evolved vector is less than a prefixed value 
ANGLMX 

STEP 5: If no points satisfy the criteria in STEP 4, relax the parameter SCALMX and 
repeat STEP 4 until such a point is found 
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STEP 6: Set 
Repeat steps 


4"’ = Store i, = 4'>/4‘’ 


4-6. In general, for the iteration. 



(2 21tt) 


(2 216) 

1 = 

(2 21() 


Continue bil i = N - {n - l)k = N' (say) The estimate ot the largest Lyapunov 
exponent is then given by 



That this definition indeed gives the largest Lyapunov exponent uin be seen 
from the fact that the choice of replacement vectors come irom the tull measure 
set It IS necessary to periodically replace the principal axis vector so as 

to prevent it from gomg through a global "told" when we are only interested in 
averaging the local "stretch" The parameters that need to be fixed before running 
the algorithm are the embedding dimension n, time delay S('AI MN, SCALMX, 
ANGLMX and the number of evolution steps q Although fixing the parameter 
values requires some experience with the algorithm, the estimate is usually robust 
to a reasonably large range of parameter variation 

Estimating negative Lyapunov exponents from time series is difficult because it is 
often impossible to resolve information about the contracting state space directions 
Volume elements involving negative exponent directions collapse exponentially fast 
For most applications, finding the largest Lyapunov exponent is sufficient because 
a positive value indicates the sensitive dependence of trajectories on the initial 
condition 


2.4.2 Results from Speech Signal and Some Comparisons 

We have used the above algorithm to estimate the largest Lyapunov exponent from 
phonemes The speech database and the ADC parameters are as given in database 1 
(Appendix B) Again, we excluded plosives from this analysis because ot the highly 
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nonstationary nature of any meaningful length of the corresponding time series The 
exponent was computed for all the other phonemes using the algorithm described 
in section 2 41 The algorithm was run several hmes on each time series using 
different sets ot parameter values The results obtained are fairly stable over a large 
range of parameter values The data length used for computation is N = 2000 In all 
instances, the Lyapunov exponent converged in far fewer iterations than afforded by 
the data length Figure 2 9 shows the convergence of the largest Lyapunov exponent 
as a funchon of the number of iterahons for two phonemes — /6/, Ai = 4161 
and /§/, Ai = 2038 ls“^ The time series in each case consists of 2000 samples An 
evolution step, q = 5 per iteration allows approximately 400 iterations 

The largest Lyapunov exponent gives an indication of the steady state behaviour 
of the dynamical system The results of the largest Lyapunov exponent estimation 
are summarized in Table 2.1 The error values indicate the standard deviation (s d ) 
of the mean value of the Lyapunov exponent per phoneme type over four speakers 
In the case of cardinal vowels, the s d is over four utterances by a single speaker 
The positive values give evidence of the exponential divergence of nearby trajectories 


Phoneme 

Type 

No. of 
Phonemes 

Largest Lyapunov 
Exponent (s"^) 

Cardinal Vowel 

8 

3482 2 ± 182 5 

Nasal 

7 

2198 9 ± 700 3 

Trill 

2 

3742 7 ±1382 4 

Tap or Flap 

2 

4469 9 ± 1009 4 

Fricative 

22 

2597 3 ± 854 9 

Lateral Fricative 

2 

2105 9 ± 774 3 

Approxmant 

5 

3345 4 ±1191 1 

Lateral Approxmant 

4 

3247 2 ±725 7 

Mean = 2899 0 


Table 2 1: Mean value of the largest Lyapunov exponent for various phoneme types 
The error values indicate the s d of the mean per phoneme type over 4 speakers 
In the case of cardinal vowels, the s d is over 4 utterances by a single speaker 
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To place more ta.th m the algorithm and result-., we 
with known dynamics and used the same algo.ithm to ec 


generated three time series 
aluate the largest Lyapunov 


exponent The three equations are 

= 10sm2r)07rt 

y,(t) = 3 sin 250714 f 7 sin rrt i 4 mu125.t4 
y^{t) = (sin 50714) (sin50yil!T4) 


(223a) 
(223b) 
(2 23c) 


edt) and y^{t) are periodic with one and three trequencies respectively while viW - 
,s quastpenodtc with two incommensurate trec,uencies Ihe three '--‘-ms 
sampled at 16 kHz to produce three time senes .lO.ViU, cdH,. - h . 000. 
IdeaUy, all the three time senes should produce zero Lyapunov exponents Using 
the above algorithm, the computed exponents aie L.lh t hl-l. ih ) I U'> a" 
331 ±38 2 respechvely The error values show the variation with respect to the 

algorithm parameter values 

To compare the results with those of speech signals, we now mimic the conditions 
^ = 1^2,3, under which the speech signals were analysed Ihe three tune 
senel are multiplied by appropriate gam tactors and quantized to 12 bits/samples The 
exponent for the three quantized time series are obtained as 1 1 7 1 2M) *-,51 0±41 
and 294 0 i 89 3 respectively Next consider the etfcct ot observational noise Noise 
with uniform density is generated using a one step whitening tilter and added to the 
time series prior to quantization Five cases of noise s d equivalent in magnitude 
to the lower 0 to 4 bits respectively are considered The corresponding largest 
Lyapunov exponent of the three time senes are shown in fig 2 10 It is seen that 


the 0 xpon 8 nt value increases with noise variance 

In all instances of speech signal analysis, the Lyapunov exponent is greater tha 
600s'’- For synthesised data, barring the unrealistic instances of noise s d equal to 
and 4 bits, the exponent is significantly less than 600 s and is occasionally negative 
Moreover, the mean for speech signals is one to two orders of magnitude greater 
than that for computer generated periodic and quasipenodic time senes, even in 
presence of significant additive noise in the latter cases Hence, Lyapunov exponent 
analysis shows that reconstructed speech trajectories can be distinguished from 
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Fig. 2.9: The convergence plot of the largest Lyapunov exponent as a function of the 
number of iterations for two phonemes 



Fig. 2.10: The largest Lyapunov exponent for 3 time series generateci from eq 2 23 (a)- 
(c) as a function of the variance of additive noise The bottom to top plots 
correspond to eq 2 23(a), (b) and (c) respectively 
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periodic and quas.periodic behaviour in that nearby speech tra|ectories exhibit expo- 
nential divergence on the average The relation between Lyapunov exponents and 
other dynamical invariants, namely dimension and metric entropy, will be discussed 

in the next chapter 



Chapter 3 

Dynamical Analysis of 
Speech Signals— 2 


Dimension and metric entropy are two important parameters characterizing the 
behaviour of dynamical systems The state space reconstruction theorems of Takens, 
discussed in chapter 2, allow us to get an estimate of these "mvanants" from a scalar 
observable of the time evolution of a dynamical system This becomes important 
when one wants to build nonlinear state space models of observed phenomena In 
this chapter, we review the definitions of dimension and dynamical entropy and their 
estimation procedures from hme series data We also give results of our estimation of 
these mvanants for unit speech utterances, namely phonemes, and draw conclusions 
from them Towards this end, the chapter is organized as follows In section 3 1, 
we dwell qualitatively on the nohons of dimension and entropy A discussion of 
the concepts of invariant and natural measures of a dynamical system m section 3.2 
builds a ground for the definitions of dimension and dynamical entropy in sections 3.3 
and 3 4 respectively We also consider the estimation of these dynamical invariants 
from the generalized correlation sum m these respective sections Based on the 
developments in sections 3 3 and 3 4, we review a method for the joint estimation 
of the correlation dimension and second order entropy from time series in section 
3 5 Section 3 6 deals with the estimahon of correlahon dimension from speech time 
series We give results of the numerical computahon from speech time series as also 
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from a simplified model of a specific vowel utterance In section 3 7, we discuss some 
implementation aspects of the correlation sum algorithm used in the estimation of 
the invariants We also consider how various sources of error in estimation affect the 
dimension results Finally, in section 3 8 we present the results of the computation 
of second order entropy from speech time series and draw conclusions from them 

3.1 Notion of Dimension and Entropy 

Various definitions of dimension and entropy are used to characterize, both theo- 
retically and experimentally, the steady state behaviour of dynamical systems The 
notions of dimension and entropy have been given a wider interpretation in the 
last few decades to quanhfy dynamical systems behaviour, and specially the complex 
behaviour of chaotic dynamics 

Dimension is one of the basic properties associated with the geometrical descrip- 
tion of an object An unqualified reference to the dimension of an object is usually 
made to the Euclidean space in which it resides Dimension is thus associated with 
the "mmimum number of independent directions" of the space containing the object 
One can also look upon the dimension of an object as the exponent according to 
which its "volume" or "bulk" scales with the resolution For example, the volume of a 
cube scales as the third power of its side In dynamical systems literature, dimension 
IS used to specify the number of degrees of freedom that a system possesses [137], 
[37], [61], [50] It IS useful here to distinguish between nominal and effective degrees 
of freedom Nominal degrees of freedom refers to the number of state variables 
needed to describe the dynamical system Although there may be many nominal 
degrees of freedom, the dynamics may settle down on or approach a subset of the 
state space, called the attractor Attractors exist only for dissipative dynamical systems 
and can be any one of four types, namely, fixed point, limit cycle, q-torus in the 
case of quasipenodicity and strange attractor in the case of chaotic dynamics (see 
Appendix B and references therein) The dimension of these objects on which the 
steady state dynamics settles down gives the number of effective degrees of freedom 
The dimension of the first three types of attractors are integral numbers Chaotic 
dynamics produces complex trajectories which asymptotically settle down on strange 
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attractors These attractors are usually fractal or self-similar and the scaling exponents 
of their appropriate measures with respect to the resolution are fractional or non- 
integral Thus, strange attractors generally have nonintegral dimension and various 
definitions exist to precisely give the number of effective degrees of freedom Given 
a time senes as a scalar projection of a state space trajectory, dimension analysis can 
be used to quantify its complexity It gives the necessary and sufficient number of 
state variables needed to approximate the steady state dynamics from which the hme 
senes is realized. Since dimension calculation from time senes gives the number 
of effective degrees of freedom m the underlying dynamics which is usually much 
smaller than the number of nominal degrees of freedom, a phenomenological state 
space modelling scheme of the time series can inherently elimmate any redundancy 
of vanables 

Kolmogorov or metric entropy is another important measure of dynamical sys- 
tems by which their time evolution can be characterized [125], [50] Historically, 
entropy has thermodynamic ongins where it is used as a measure of the disorder in 
a given system The increase m disorder of a system is associated with an mcrease in 
the uncertainty of the state of the system at a given instant of hme In informahon 
theoretic terms, entropy of a system denotes the average uncertamty or the amount of 
information required to locate the system in a specified state Metric entropy quanh- 
fies the "degree of chaos" of a dynamical system When a dynamical system is under 
observation, each new measurement is a potential source of additional informahon 
about the system For example, in the case of fixed pomt behaviour, no addihonal 
information is obtained after a single measurement m the steady state Similarly, 
in the case of periodic and quasipenodic behaviour, no additional informahon is 
obtained after a finite number of measurements In contrast, a chaotic dynamical 
system continues to produce new informahon with each succeeding measurement 
The metric entropy, K, is an asymptotic measure of the average rate of informahon 
produchon as a function of hme Thus, itT = 0 for situations of regular behaviour 
(fixed point, limit cycle, quasipenodicity), 0 < IT < oo for chaotic behaviour and 
iif = oo for truly random behaviour When the laws governing the dynamical system 
or a phenomenological state space model of the observed time series is available. 
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metnc entropy can also be looked upon as the asymptotic rate at which intormahon 
about the initial condition driving the system is lost In other words, metric entropy 
IS propothonal to the inverse predidabthty time of the evolution of a dynamical system 
from a given inihal condition 

The theory of state space reconstruction discussed in chapter 2 allows us to 
reconstruct trajectories in the state space from a scalar time series (or ohservahk) 
of the dynamical system evolution such that the dynamical invariants evaluated 
from the reconstructed trajectory are the same as those of the underlying dynamical 
system This means that it is possible to estimate dimension and metric entropy 
(also Lyapunov exponents, Chapter 2) of a stahonary vocal tract configurahon from 
the corresponding speech signal In the next section, we discuss the concepts ot 
invariant and natural measures of dynamical systems These are relevant in the 
context of the definitions of dimension and dynamical entropy 

3.2 Invariant and Natural Measures of a 
Dynamical System 

The characterization of dynamical systems behaviour through dimensions and entropy 
can be done either from a physical or phenomenological model of the dynamics or 
from an observation of its time evolution. This is because dimensions and entropy 
are estimated from the invariant probability measure of the dynamical system which 
can be analyhcaUy obtained from the physical model of the dynamics or estimated 
from its time evolution by invoking ergodicity [36], [37], [137], [132] To discuss 
this further, let us consider a discrete time dynamical system f M M, where M 
(usually BJ^) is the state space on which the map evolves from an initial condition 
So 

Si+i=f(s,), 1 = 0,1, (3 1) 

In the case of continuous hme dynamical systems, assume that a corresponding 
discrete time map f exists by discrehsing time or Poincare sectioning [125] In 
experimental situations, one can only observe the time evolution of one or at most a 
few variables of the dynamical system f Consider a situation where just a scalar time 
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series cci, i = 0, 1, is observed through the smooth observable function h M R 
on the dynamical system That is, 

= /i(f(so)), ^ = 0,l,2, (3 2) 

One can talk of dimensions, metric entropy and Lyapunov exponents of a single 
trajectory of a dynamical system However, it is more meaningful to associate 
these terms with a statistical descnption of the dynamics In regular and chaotic 
dynamical systems, there exist sets of initial conditions which give rise to families 
of trajectories having the same statishcal properties, for example, when different 
trajectories of a dissipative dynamical system asymptotically settle on the same 
attractor A geometrical description of attractors, particularly strange attractors, 
presents mathematical difficulties Shifting the attention to a description through 
invariant measures simplifies the proceedings 

An invariant measure is one that does not change under the action of the dynamics 
That IS, a measure p is invariant under the map f, if for any subset S of M which 
IS in the support of p, 

p{S)=p{r\S)) (3 3 ) 

for all 1 > 0 Here, f'‘(5) = {s G M f*(s) G S} In other words, a measure 
of a set IS invariant if it is equal to that of the sets mapped mto it Lebesgue 
measure, which is one of the most common measures, is not invariant with respect 
to dissipative dynamical systems because state space volumes contract under the 
action of the dynamics Therefore, one has to look for other physically relevant 
invariant measures 

Although a dynamical system may have many mvariant measures, not all of them 
are relevant because they may not be physically observable Of particular importance 
IS the (computer generated) time evolution or experimental observation of dynamical 
systems Operationally, they appear to produce well defined time averages This 
gives us a selection procedure for a measure p through ergodicity. Such a measure 
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IS called the natural measure of the dynamical system which is invariant with respect 
to the dynamics Thus, we have 


N-l 




(3 4) 


1=0 


where /s(s) = 


1 if s G S' and 0 otherwise Also, 

^ ^ w ^ elf* (so)) 

jY-voo N ^ iv-^ocy N 


L=0 


1=0 


= [ g{l{S))dii{S) 

JM 


(3 5) 


for a continuous function g(s) over the map f of the underlying dynamical system 
A dynamical system may have many natural measures corresponding to different 
families of trajectories In a dissipative dynamical system, tor example, each attractor 
will have its own natural measure For an attractor, the natural measure is dis- 
tmguished from other measures because it is stable to low level dynamical noise 
That is, if the deterministic dynamical system is replaced by one with some additive 
dynamical noise, as long as the noise is small enough, the natural measure of the 
noisy system will remain similar to that of the deterministic system 

The natural measure can be estimated numerically from a time evolution of the 
dynamical system using a histogram or kernel density estimation procedure 

The vocal tract system is a complex dynamical system and an accurate physical 
model must consider the effects due to time variation of the vocal tract shape, 
losses due to heat conduction and viscous friction at the vocal tract walls, radiation 
of sound at the lips, nasal coupling, the excitation function due to air flow from 
the lungs etc. [114], [124] These considerations suggest that an accurate physical 
model would be a nonlinear dissipative dynamical system which can possibly show 
chaotic behaviour The computation of dimension and metric entropy (also Lyapunov 
exponents) is a tricky business because of the time varying nature of the vocal tract 
and the simultaneous requirement of large data length and stationarity in the time 
interval m which the computation is done When we refer to dimension, metric 
entropy and Lyapunov exponents as "dynamical invariants", we assume that there 
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exists a stationary (or invariant) probability measure of the underlying attractor The 
numerical evaluation of these invariants from a single time evolution (or realization) 
of a dynamical system assumes ergodicity which is rarely proved even m those cases 
where a physical model of the dynamics is available The time varying nature of 
the vocal tract dynamics and the consequent nonstahonarity of the speech signal 
requires us to compute dimension and metric entropy from approximately stahonary 
segments of phoneme utterances. It is unportant to note that these parameters 
characterize the corresponding speech signal and their reconstructed trajectories 
Linking them to any underlying attractors is only hypothetical and not proved 

3.3 Definitions of Dimension and Relation with the 
Generalized Correlation Sum 

In diynamical systems theory, dimension is used to characterize the objects on which 
trajectories asymptotically accumulate As mentioned before, the intuitive notion of 
dimension is as a scaling exponent of the "bulk" or "volume" of the object with 
respect to the resolution, i e , 

bulk ~ {resolutianf^'^^'^^^^ (3 6a) 

Here, bulk can refer to the volume, mass or some measure of the object The 
definition of dimension is usually made in the form 

dimension = lim (3.66) 

resoluUcm—t^O iog TCSOlutlOTl 

In dissipative dynamical systems, the steady state motion takes place on or asymp- 
totically approaches an attractor which has Lebesgue measure zero. In this case, 
dimension is used to characterize the attractor on which typical trajectories accu- 
mulate, 1 e those which originate from a set of initial conditions having positive 
Lebesgue measure The relevant measure of the bulk or volume of the attractor or 
any subset 13 is the natural invariant measure p{A) or p{B) associated with it The 
normalization of this measure for any subset B with respect to the attractor A will 
be the corresponding probability that a random point on the attractor lies in i e , 
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For simple attractors characterizmg regular motion, any reasonable definition of 
dimension gives the same number Thus, a fixed point has zero dimension, a limit 
cycle has dimension one and a g-torus has dimension q More complicated geometrical 
objects like chaotic attractors often exhibit fractal properties [125], [8] There exist a 
variety of definitions of dimension (often producing fractional numbers) to describe 
these objects. Not all of them give the same number. In fact, the differences between 
these dimensions can be used to characterize the fractal structure of strange attractors 
These definitions of dimension can also be used to characterize regular attractors 

The ongm of these definitions of dimension can be traced back to Hausdorff 
who gave a completely rigorous definition of dimension in 1919 The Hausdorff 
dimension utilizes a purely geometrical description of the possibly fractal set and 
makes no reference to the measure p defined on the attractor [137], [39], [152] It 
is also not amenable to numerical estimates To define the Hausdorff dimension 
of a set A lying in a n-dimensional Euclidean space, consider a finite covering 
C{r,A) = ,Bk} of A with sets Bi,l = 1, whose diameters r, are less 

than r Define the quantity /^(r) by 




mf V 


C(r,A) 




(3 8) 


where the infimum (or minimum) extends over all coverings satisfying the constraint 
ri < r. Now let 

Id = lim ld{r) (3 9) 

r-~fO 

Then there exists a critical value of d above which Id = 0 and below which Id = oc 
This critical value, d — dfj, is the Hausdorff dimension 

The difficulty in estimating the Hausdorff dimension numerically is that an 
infimum over all coverings has to be taken If one relaxes this requirement and 
chooses a covering that uses fixed size boxes or a partition of side resolution r , then 
one obtams an upper bound to the Hausdorff dimension This is variously referred 
to as the capacity, the fractal dimension or the box counting dimension [137], [39], 
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'«('•) = E '■!' 
= E^^ 

i 

= n(r)r‘’ 


(310) 


where n(r) is the number of nonempty boxes of resoluhon r The capacity Dc is the 
value of d at the transition between 1^ = 0 to Id = oo. Taking Id ~ 1, gives 


or, formally. 


Dr = lim 




(3 11a) 


(3 116) 


'■-‘0 log r 

The definition of capacity only uses the information about the number of 
nonempty boxes It does not take into account the number of points in each 
box In other words, the underlying probability measure is ignored and a uniform 
probability for each box is assumed 

The generalized dimensions [66], [54], [112] were proposed to give a unified 
approach to the various definitions of dimension existing at that time and explain 
the differences in them Generalized dimensions take into account the probability 
measure of each box covering the attractor Let IT = [7ri,7r2, ,TrM{r)] denote a finite 
partition of side resolution r of the attractor A. Here, M(r) denotes the number 
of partitions of IT. Also let Pr,, = be the normalized probability measure of the 
partition tti, 1=1, ,M{r) The generalized dimensions Dq are defined by 


Dq = hm 

r— *0 ^ — 1 


M{r) 

log ( E PI 

\ 1=1 


(3 12) 


for any q ^ 1 The value for g = 1 is defined by takmg the limit as g — ^ 1 This is 
known as the information dimension [137], [39], and is given by 


Di = hm Dq 

q-^l 


= hm 

r-^O 


log p. 


(3 13) 
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Of particular importance is D 2 which is called the coTrelatton dttnenston [57], [58] It 
IS given by 

log (Efif’ 


£>2 = hm 

r-O 


logr 


(3 14) 


It IS one of the most popular dimension computation algorithms and can be easily 
used on experimental hme series data via the correlation sum as discussed below 

Similarly, it can be shown that Dq is the capacity or the fractal dimension 
For a uniform probabihty measure on the attractor, Dq is the same tor all g For 
a nonuniform attractor or multifractal, the variation in Dq with q quanhfies its 
nommiformity For example, 


Doo = hm 

r -*0 

D-00 = hm 

r -+0 


log (max; Prq) 
logr 

log (mm; Pn-,) 
logr 


(3 15a) 
(3 156) 


Dq is a nonmcreasing function of q 

There are other definitions of dimension that are useful in specific contexts 
The topological dimension is the dimension of the Euclidean space which the attractor 
resembles locally [70]. It is, by definihon, an integer number The pointwise dimension 
IS a local measure of the dimension [137], [39] It denotes, in the limit, the scaling of a 
measure of a box with its resolution around a local point on the attractor The average 
pointwise dimension is the corresponding global quantity given by an average of 
the pointwise dimension on the attractor Another dimension algorithm which gives 
an integer number even for fractal attractors is the local intrinsic dimension [47], [65] 
The basic idea is to compute the number of governing parameters in local regions 
The average local intrinsic dimension is the corresponding global average The k-th 
nearest neighbour dimension [39], [20] considers fixed mass boxes instead of hxed size 
boxes as in the case of generalized dimensions For strange attractors, there is a 
conjecture due to Kaplan and Yorke [78], that relates the dimension to Lyapunov 
exponents. Specifically, if Ai > A 2 > > Xj are the d Lyapunov exponents of the 

dynamical system, then the Lyapunov dimension D^ is given by 
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where j is the largest index for which the sum over k is nonnegative The conjecture 
IS that Di IS equal to the informahon dimension 

Let us now consider the issue of numerical estimation of the generalized dimen- 
sions from time series data The theory of state space reconstruction, discussed in 
chapter 2, gives a method for reconstructing the attractor from a scalar observable 
of the dynamics Consider a time series x,, i = 1, ,Ar' obtained from the time 
evolution of a dynamical system f through the smooth scalar observable function h 
A hme delay reconstruchon m produces a vector time series 

~ [^t ®t+(a-i)i:] , i = 1, , iNT (3 17) 

for an integer delay k and embedding dimension d and N = N' - {d- 1)A: If m is 
the dimension of the manifold (generally unknown aprtori) on which the dynamics f 
evolves, then d > m is a necessary condition and d > 2m -I- 1 is a sufficient condition 
to ensure that the reconstructed vector time series is an embedding Therefore the 
dynamical invariants evaluated from the dynamical system through s^, z = 0, 1, , 

eq (3 1), or the reconstructed time series xf , z = 0, 1, , eq (3 17), are the same 

This IS particularly useful when a physical model of the dynamical system is not 
available 

A direct method for the numerical estimation of generalized dimensions is to 
estimate the in eq (3 12) by counting the fraction of the total number of trajectory 
points xf , i = 1, ,N in partihon tt/, Z = 1, , M{r) However, a major disadvantage 

of this method is the requirement of large data length N which rapidly becomes 
impractical with increasmg embedding dimension d [62] One way of reducing 
the effect of these limitations is to estimate the dimensions numerically from the 
generalized correlation sum [125], [50], [110] As a special case, the computation 
of correlation dimension D 2 from the correlation sum has become a most popular 
algorithm for dimension calculation 

There is an important approximation that relates the generalized correlation sums 
to YlfiV P 12.) Let us consider this approximation and the subsequent 

relation of Dq to the generalized correlation sum Let B^d{r,d) represent a ball of 
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radius r around the point on the reconstructed trajectory of embedding dimension 
d Also, let 




pi[B^a{r, d)] 

p[A] 




(3 18) 


For a point on the reconstructed trajectory, By^d{r^d) can be estimated from 

B^(r,<i) = -^5^e{r-i|xf-xf||) (319) 

t=l 

where Q{arg) = 1 for arg > 0, and 9{arg) = 0, otherwise The choice of the metric 
in eq. (319) determmes, for example, whether the ball is a d-dimensional 

sphere (L 2 norm) or a cube {Loo norm) However, the dynamical invariants are 
immune to the choice of the metric. Let n = [tti, ,7rv/(r)] be a finite partition of 
side resolution r of the attractor A where M(r) is the number of parhtions and let 
be the probability of xj, 1 = 1, ,M{r) Also let be the probability 

of that partition x, which contains the jth point of the trajectory If this partition 
contams points of the trajectory, then the important approximation is 

B^4r,d) ^ 

= ^.,0, (3 20) 

The idea behind this approximahon is that most points in x/^^j will be within r of x^ 
and although some points that are farther away will be ignored, others that are close 
enough but m neighbouring partihons will be counted The error in approximation 
is only a factor of order unity and certainly not larger than the number of nearest 
neighbour boxes [50] 

The generalized correlation sum is given by 

J~l 
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The foUowmg derivation shows the approximation between the generalized corre- 
lation sum and the probability of partihons It proceeds by substituting for 
B and replacing the sum over points on the reconstructed trajectory with a 

sum over the parhtions and a sum over the trajectory points in each partition 


J=1 

M(r) iV 


/=i j=i 
M(r) 

= [-fV/ 

/=1 

M(r) 

= E 

i=i 


(3 22) 


Comparing eqs (3.12) and (3 22), we have 


Dq = lim 


1 log[C'q(r,d,iV')] 
^-1 logr 


(3 23) 


This formula is easiest to evaluate for ft As seen from eq (3 21), Ct{r,d,N) is an 
arithmetic mean However, this method of evaluating ft does not work very well 

for g < 1 


3.4 Definitions of Dynamical Entropy and 
Relation with Generalized Correlation Sum 

The generalized dimensions of an attractor are static invariants since they only utilize 
the time invariant measure of the attractor However, entropy is a dynamic invariant 
because it quantifies the increase in uncertainty or the toss of information about the 
state of a system with time Metric entropy asymptohcally quantifies the average rate 
of toss of information about the state of a dynamical system as it evolves in time 
The idea of characterizing dynamics by an information rate is due to Kolmog 



74 


3 Dynamical Analysis of Speech Signals — 2 


while Sinai gave a ngorous definihon and proved that it makes sense Thus, metric 
entropy is also known as the Kolmogorov or Kohnogorov-Sinai entropy 

If a dynamical system is observed by making measurements, then at the time 
instant of a measurement we know about the state of the system with a certain 
level of uncertamty depending on the accuracy of the measurement The level ot 
uncertamty about the state of the dynamical system may remain the same upto the 
next measurement, le there is no loss of information of the previous measurement 
as happens for completely predictable trajectories On the other hand, for chaotic 
trajectories, the level of uncertamty of the state of the system increases with time 
(loss of information) till the next measurement which reduces the uncertainty back 
to the onginal level 

The metric entropy K for a dynamical system f as in ecj (3 1) can be defined 
as follows [63]. Consider a trajectory s,, i = 1, of the dynamical system f 

The state of the system is measured at intervals of time T Let the state space be 
partitioned by a finite parhhon Yl = 7rv/(,)] of side resolution r and number 

of parhhons M{r) Also, let P,,, be the joint probability that s{t = T) = Si is in 
some partition Vk = ii, s(f = 2T) = S 2 is in parhtion i 2 , and s{t = nT) = s,, is in 
parhhon tn Then, according to Shannon, the quanhty 

jr(n) = - ^ P.., ,.logF,„ (3 24) 

is the average informahon required to locate the system on a particular trajectory 
with precision r (provided only the apriort probabilihes are known) 

Therefore, K{n-{-l) - K{n) is the additional average information required to predict 
in which parhhon will the system state s„+i be, provided we know 
That is, iiL(n 4-1) - K{n) measures the average loss of information about the system 
from time n to n H- 1 

K is defined as the average rate of loss of information 

1 ” 

K = lim lim Um — ^ + 1) — K{1)) (3 25^) 

T_0r^0n.-oo nT ^ 

= limlimlim -^•IT(n) (3 256) 

T-»0 r-O n-^oo nT 

= lim lim lim [■^^(h -1- 1) — K{n)] 

T -^0 r-^Q n -»oo ± 


(3 250 
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For a more ngorous definition of metric entropy and a discussion on the importance 
ot generating partitions in its computation, see [152], [31] The equality between 
eqs (3 25a) and (3 25b) is easy to establish. The equality between eqs (3 25a) and 
(3 25c) can be arrived at by noting that the former is a limit of the Cesaro sum of 
the sequence obtained from the latter Thus, it can be proved that the form of the 
expression m eq (3 25c) converges faster than the one in eq. (3 25a) [31] 

Metric entropy can be eshmated for both conservahve and dissipative dynamical 
systems For completely predictable systems, K = 0, while for chaotic systems, 
0 < K < oo Ideal random behaviour is characterized by if = oo It is also important 
to note that K is inversely proportional to the average time over which the state of 
a dynamical system can be predicted This is because K quantifies the rate of loss 
of information about a system state Thus, the predictability time Tp is given by the 
proportionality relation [63], [128] 

Tpcx^log- (3 26) 

K r 

where r is the precision with which the inihal state of the system is located 
Beyond Tp, one can only make statistical predictions about the system state Thus, 
T), = oo,0 < Tp < 00 and Tp = 0 for completely predictable, chaotic and ideally 
random behaviours respectively There is also an mteresting relation between the 
metric entropy and the spectrum of Lyapunov exponents given by 

K < ^^{positive At) (3 27) 


The equality which holds almost always for natural measures is known as the Pesin 
identity [36]. 


Just as we have generalized dimensions, there exists a set of infinite order-cf 
Renyi or generalized entropies that characterize the space-time behaviour of dynamical 
systems These are defined as follows [50], [112], [116], [60] 


Kn = - lim lim lim rr 

^ T— 0r-*0n-+oo nj. {q — 1) 

hm K, = K 

- 7-.1 ^ 


io8 E n. 1 ^ » 


(3 28a) 
(3 286) 
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It can be seen that for (^ > q. Just like the correlation dimension D., 

the second order entropy K 2 is singled out because of its ease ot calculation from a 
time series Furthermore, since K 2 < K, we have iiT, > 0 tor chaotic systems 

The method of generalized correlation sum can be extended to estimate the 
order -9 entropies from time senes data Proceeding on similar lines as in the case ot 
generalized dimension, one can easily show that 


Kg = - lim luu 

^ r~^Qn~-*oo 


1 

nT{q - 1) 


log 


a;,«(r,d,Ar) 


(3 29) 


where 
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/ 0 rn 0 


1 1 *m 
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(3 30a) 


or 




"n- 1 d-l 

EE(-..‘.™ 


Li 0 m-o 



(3 306) 


for the Loo and the L 2 distance norms respectively The reconstructed vector time 
senes, given by eq (3 17), is used with time delay k - I tor simplicity ot illustration 
above The summation over m denotes the distance between two points xf and 
on the trajectory whereas the summation over I measures the distance between two 
sequence of points on the trajectory 


3.5 A Unified Approach to the Estimation of the 
Correlation Dimension and Second Order Entropy 
from Time Series 

In this sechon we review a unified estimation procedure for generalized dimensions 
and entropies from a scalar time series observation ot a dynamical system evolution 
As mentioned before, the estimation of the correlation dimension D> and second order 
entropy K 2 from the correlation sum is a relatively easy exercise The importance ot 
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these invariants lies in their immediate relevance in a deterministic nonlinear state 
space modelling effort 

Usmg the state space reconstruction theorems discussed in chapter 2 and recog- 
nizing that Kq IS evaluated in the limit n —* oo, we can replace this limit hy d oo 
and the two summations over m and I for in eqs (3 30a) and (3 30b) by a single 
summation over m Thus [110] 


^ E E ® ('■ - II"? - 


-iq-l 


z=l 


(3 31) 


which is the same as Cq{r,d,N) in eq (3 22) and 

1 


Kn = - Iim lim - 

r— Od— .oo dT(^q — 1) 


log Cq{r,d,N) 


{3 32) 


We now have Cq{r,d,N) ~ 0?^? from eq (3 23) and Cq{r,d,N) ~ from eq 

(3 32) More specifically, observing the corresponding limits, we have 


lim hm — log Cq(r, d, N) = Dq log r - dTKq 

r-^Q d— *00 Q — 1 


(3 33) 


This yields, for g = 2, an expression for obtaining D 2 and Ki from the correlation 
sum C{r,d,N) 

hm hm log C(r, d, N) = D2 log r - dTK2 (3 34) 

r-.0 d-^00 

where 


C(r,d, N) = C 2 ir,d,N) 


= 7^^EEe(-IK-"? 

' J = l t=l 


(3 35) 


and xf is given by eq (3 17) 

The practical implementation should now be clear [59], [24] We should obtain 
plots of log C(r,d,iV) vs logr for increasing values of the embedding dimension d 
The estimation of Dy from 


Dzid) = hm lim 

^ ' r-.0iV-.oo 


logC'(r, d, N) 
logr 


(3 36) 



78 


3 Dynamical Analysis of Speech Signals — 2 


IS inefficient because D 2 converges logarithmically slowly tor r 0. Moreover, the 
correlation sum is dominated by additive noise at the r 0 limit which makes 
reasonable estimation difficult To overcome these problems, the usual procedure is 
to take a local slope at r away from zero or fit a straight line between two r values 
in which the graphs are approximately linear and obtain the slopes tor increasing 
values of d. 


Doid) = 


d log C(r,d, N) 
d log r 

dC(r, d,N)/dr 
C{r,d,N)/r 


(3 37) 


D.{d) = l^sC{n.d.N)- h^C (jjJ. W) J 

logr^-logri 

One can also do a least squares fit between rj and r> The value ot D>{d) should 
converge with increasing values of d, which is then read oft as the correlation 
dimension Do 

Similarly, the quantity 




[ C{r,d,N) ' 
C{r,(i+l,iV) 


(3 39 ) 


converges to K 2 for increasing values of d. Note that in eqs (3 37) and (3 39) we 
have suppressed the notation of the explicit dependence on the resolution i and 
data length N. 

In the next section, we give the dimension estimation results from speech signals 
spoken in the form of phonemes by various speakers and from a simplified statistical 
model for a specific speech signal 


3.6 Correlation Dimension Estimation for 
Speech Time Series 

A dimension analysis of the vocal tract system has to take into account the time 
varying nature of the dynamics It is meaningful to estimate the dimension of a 
dynamical system only when the characteristics of the object (the attractor) do not 
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change in the hme interval under investigation. Because of the physical nature ot 
the vocal tract, we assume that its characteristics remain constant for some time We 
will do the dimension analysis of the vocal tract system based on speech signals 
When we compute the dimension from a block of speech data, we are actually doing 
so for a particular reconstructed trajectory The relationship between a reconstructed 
trajectory and the natural mvanant measure of the attractor of the underlying 
dynamical system is only hypothesised 

3.6.1 Numerical Computation from Speech Time Series 

The speech signals used for the computation of dynamical invariants is as given 
in database 1 m Appendix B It consists of 57 consonants of the IPA spoken by 4 
trained persons (3 males and 1 female) and 8 cardinal vowels spoken 4 times by one 
person (Daniel Jones). To approximate the probability measure by the correlation 
sum m eqs (3 22) and (3 35), it is necessary to approximate the N oo limit, i e , 
use as large a data length as possible Because of the extremely short duration of 
utterances of plosives, they are excluded from this analysis Thus, dimension analysis 
was done on 4x 44 +-4x8;= 208 phonemes as m the case for Lyapunov exponent 
analysis in chapter 2 and second order entropy analysis in section 3 8 The graphs of 
logC(r, d, N) vs logr are obtamed for these phoneme time series The embedding 
dimension d was varied from 1 to 16 in each case and 20 log linear divisions of r 
were used to approximately cover the entire range of distances for the reconstructed 
trajectories La-norm was used to evaluate the distances used m the correlation sum 

Figure 3 1 shows a typical graph of log C{r,d,N) vs logr for increasing values of 
d The slopes D 2 (d) are obtained from a linear Ht in the approximately linear region 
of the plots Figure 3 2 shows the corresponding plot of D^id) vs. d for d = 1, , 16 

For comparison, we have also plotted D 2 (d) vs d for a white Gaussian noise sequence 
of the same length The time series was generated using a computer based random 
number generation algorithm For true random behaviour, D 2 {d) = d Table 3 1 
summarizes the results of dimension analysis performed on the 208 phonemes The 
correlation dimensions reported here are based on the values of the slope at d = 16, 
le , D 2 = L>2(16) The mean value of D 2 over the phoneme set is obtained as 3 74 
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Fig. 3.1: Plots of \ogC{r,d,N) vs logr for (i = 2,4, 

vowel utterance /a/ DR-STV using 400 sample 


, K) and N 4()()0 tor cardinal 
segments = 4 39 dB 



Fig. 3.2: Graphs for the computation of D 2 for (a) cardinal vowel utterance /a/ using the 
\ogC{r,d, N) vs. logr plots of Fig 3 1, and, (b) Gaussian white noise sequence of 
length N = 4000 Part (c) shows the ideal scaling of the slope for true stochastic 
time series 
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The mean dimension values over 4 speakers tor lateral fricatives and tncatives are 
greater than 4 0 and together account for the highest and second highest values 
respectively amongst the 8 families of phoneme types that were analysed From this 
study, we conclude that speech is largely a loiv dimensional signal 

We quahfy the above results further with the tollowing observations. 

1. Data lengths used m the correlation sum: The data length N used in the analysis 
of each of the 32 vowel samples was fixed at 4000 For the 176 consonant samples 
N vanes from 1200 to 4800. The average length over the 208 phonemes is 2986 The 
data lengths were determined by two factors - (a) the duration ot utterance ot each 
phoneme, and, (b) the dynamic range of short time variance (DR-STV) [75] Speech 
signal, which is highly nonstationary has a DR-STV ot approximately 40dB However, 
for the computation of dynamical invariants, the data should come trom a stationary 
statistic (ideally, DR-STV=0dB) We fixed, somewhat arbitrarily, an upper limit tor 
the DR-STV at lOdB for each phoneme time series To compute the DR-STV, the total 
length IS fixed as the length N of the data used in the computation, whereas 25 ms 
(le, 400 samples at 16kHz sampling rate) of speech data is used for each segment 
The mean DR-STV for the 176 consonants analysed is 4 6 ± 2 8dB whereas that of 
the 32 vowels is 2.2 ± 1 9dB 

2. Multiplicity of scales: In some plots of log C(r,d, N) vs logr, two distinct scaling 
regions are observed at different resolutions r For example, see fig 3 3 Although 
dimension is defined in the limit i — > 0, in real world we approximate by finding the 
dimension at a finite value of r In this regard, we quote Mandelbrot [97] "What 
is the dimension of a ball of yarn? From a great distance it is effectively a point 
and appears zero dimensional, on approach it becomes a three dimensional solid, 
moving closer, we discern the one dimensional threads, which then become three 
dimensional again, the threads are again composed ot fibres, etc Ihese different 
scaling regimes would produce rather extreme oscillations in a numerical estimate 
of dimension Typically, when we are computing dimension we are interested in a 
given scaling range, but it may be very difficult to discern " In our case, whenever 
two distinct scales are observed at different resolutions r, we have computed the 
dimension corresponding to the scaling at the lower range of the resolution 
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3. Effect of autocorrelation on dimension estimation: The dimension results of 
Table 3 1 are computed from the modified correlation sum 

C{r,d,N)= + E (3 40) 

where W is chosen as 10 The purpose of the modification (of eq (3 35)) is to exclude 
those pairs of points (xf , xf) from the correlation sum which are closeby in state 
space only because they are close in time It has been shown tor specific cases [134], 
that this modification increases the scalmg region in the log C(r,d, N) vs logr plot 

If, however, we want to retain the effect of small time delay autocorrelation, then 
we can use the unmodified correlation sum of eq (3 35) No significant systematic 
difference in the dimension estimates is observed The mean value ot D> over the 
208 phonemes is obtained as 3 65 in this case 

4. Comparison with filtered white noise: This test consists ot comparing the 
correlation dimension of speech signals with that of linearly filtered white noise 
sequences having the same power spectrum The test data are created by taking 
the Fourier spectrum of each phoneme time series, randomizing the phase and 
inverting the transform We compared the correlation dimension of 13 phonemes (4 
cardinal vowels, 1 nasal, 5 fricatives, 1 lateral fricative, 1 approximant and 1 lateral 
approximant) with that of linearly filtered white noise sequences having the same 
magnitude spectrum The corresponding mean values of the correlation dimension 
are 3 32 ± 1 00 and 4 41 ± 1 04 respectively Thus, there is a systematic increase in 
the dimension by 1 09 While it is possible to differentiate speech signals from white 
noise using spectral analysis and also dimension analysis (see fig 3 2), no significant 
difference in the dimension is observed between speech signals and linearly hltered 
white noise having the same magnitude spectrum This suggests that the low 
dimensionality of speech data is largely due to the second order characteristics ot 
the data 

It has been observed that random noise sequence with power law spectra 
can exhibit finite correlation dimension when n > 1, [105] Ideally, white noise 



3 Dynamical Analysis of Speech Signals — 2 


85 


(coming from a stationary distribution) is characterized by infinite dimension This 
contradiction is resolved by observmg that in the former case there is no underlying 
invariant measure [61] Therefore, one can only talk of finite dimension of a realization 
of random noise sequence with power spectra, and not that of any underlying 
attractor or invariant measure Similarly, in light of the above result for speech 
signals, it IS more appropriate to talk of the low dimensionality of speech signals 
and their corresponding reconstructed trajectones rather than linking them to any 
hypothesised attractor or invariant measure of the approximately stationary segments 
analysed 

5 Comparison with other results: Preliminary results of the estimation of dynamical 
invariants were incorporated by us in [90], [92], [91] Other researchers have 
corroborated the result that speech is largely a low dimensional signal Dimension 
values between 3 and 5 for voiced speech are reported for a number of speakers 
in [140] Similarly, a correlation dimension of 3 3 over 30s of continuous speech 
IS reported in [143], [144] In another study, [14], the correlation dimension from 
approximately Is of vowel utterances [a , i , u ] each by 3 male speakers were 
obtained between 1 2 and 1 9 


3.6.2 Correlation Sum and Dimension from a 
simplified Statistical Model of Speech 


In this subsection, we consider the estimation of the correlation sum and the com- 
putation of the correlation dimension from a statistical model of a vowel utterance 
The purpose of this exercise is to compare the result with the numerically eshmated 
value of dimension for the same phoneme utterance The correlation sum given by 
eq (3 35) can be alternatively written as 


C{r,d,N) = 


2 

N{N-l) 


N N-n 

EEe(^- 


n=l 1=1 



(3 41) 


where 0( ) is the Heaviside function and xf is given by eq (317) Assuming that 
the scalar observable Xj, i = 1, , — 1, (i e , the speech time series) of the vocal 

tract system is a realization of a stationary random process, we can write 

C(’'. -i. -W) = E^'^ - “ ’^"11 - ’■) f'’ 
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for some i (say, t = 1). Thus, we require the joint probability density function of the 
2d random variables xi, , aJnJ-ii i ®n+i4-((i-i)A: where k is the integer time 

delay 

Various probability density functions (p d f 's) for speech signals have been pro- 
posed based on experiments Amongst them are a specialized form of the gamma 
density and the Laplacian density functions [75], [100] Short time p d f 's of speech 
segments are better described by the Gaussian p d f irrespective of whether the 
speech segment is voiced or unvoiced [75] We are here interested in the analysis of 
a specific phoneme which was chosen as the cardinal vowel /i/ This is because ot the 
relative ease m formulating a statistical model for this phoneme utterance A joint 
Gaussian density function is chosen for the corresponding time series ri, ; = 1, , iV' 

{N' = N + d-l) Its autocorrelation function (ACF) is approximated by 

(3 43) 

(0 n > 

where nx = 85,a = = 1500 Figure 3 4 shows the estimated ACF from a 

realization of the vowel utterance of length N' = 4000 and its approximation by 
eq (3 43) The covariance matrix can be found from eq (3 43), and without loss of 
generahty, the mean vector for the 2d variables can be chosen as 0 Thus, the joint 
density function is completely specified. 

Because of the relationship between the 2d random variables ot x'/ and xf „ in 
eq (342), we can simplify to d random variables , ;/,i through 

j rp 

Yn = [yi- y2, , vd] 

yi = ®n+l+(i-l)fc ~ 3^1+(1-I)*:i ^ = 1, ,d (3 44) 


It is easy to show that will also have a joint Gaussian density function 


p(yn) = 


(27r)5|Aj 


exp 


■2(yn)" c*, 


Yi 


iyi) 


(3 45) 


where Cyj is the d x d covariance matrix of yf, and Ayd is its determinant The ij'^ 
element of Cyd is given by 


C\j [(®n+l-f-(i-l)fc — 1)1,)] 

= 2ACF [(t - j)k)] - ACF [n + (i - j)A.| - ACF [n f - i)^) (8 46) 
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For simplicity of illustration, we use the Loo-norm for the metric in eq (3 42) Ideally, 
the choice of metric does not affect the dynamical invariants Thus, 

^ (||xf+n - xf II <r)=F |y/| < (3 47) 

For iV oo and the approximation ^ 0, we have 


C(r,d) = lim C{r,d,N) 

/V-> 0 O 


— — hm 

N(N-l) N-^oo 


f^ctU 

y;(jv - n)p 

-n=l 


( max Ivil ^ 
\i<i<d ~ 


+ ^ (iV -n)P( max l 2 /; I < r) 

l<Kd 


N(N- 1) N 


lim 


HsjiL. 

N 


(N- n)P(m^ly,| < r) 


= P 


max \yi\ < r 
\i<i<d 


(3 48) 


where 


P 


( max lyd < r 
\l<l<d 


f f p ^y<^ 

Jyd=-r Jyi = -r 


(3 49) 


Note that for n > p(y^) is rio longer a funchon of the delay n, since there is no 

signal correlation for n > Ucut and we assume a Gaussian density function 


An analyhcal computation of the correlation dimension from eqs (3 48) and 
(3.49) at resolution r -> 0 will give I> 2 (d) = d However, we are here interested in 
the computation of D 2 {d) at a finite range of the resoluhon r which is relevant in a 
prachcal situation To obtam a graph of logC(r,d,iV) vs logr we evaluate eq (3 49) 
at different values of r for increasing embedding dimension d Figure 3 5 shows the 
plots for d = 1, ,7 The scaling exponent at d = 7 is obtained as 4 07 using a linear 

fit. For comparison, the scaling component at d = 7 obtained from a time series 
realization of the phoneme /if of length N = 4000 is 2 95 ± 0 17 We conjecture that 
the difference in the two values can be partly attributed to the finite data length in 
the latter case However, this does not affect the nature of the conclusion drawn from 
the dimension results of section 3 61 viz speech signal is largely low dimensional 
in the resolution scale range in which the analysis was performed 
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Delay 


Fig. 3.4: The autocorrelation function for cardinal vowel utterance /i/ I he graph shows 
the time series estimate from N = 4000 samples (solici line) and its approximation 
by eq (3 43) (dashed Ime) 



log r 


Fig. 3.5: Plots of \ogC{r,d,N) vs logr for d = 2,3, ,7 from a numerical simulation of 

the theoretical eshmate of the correlation sum in eqs (3 48) and (3 49) 
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In the next section we make some remarks about the limiting regions of the 
log C(r,d,JV) vs logr plots from which the scaling exponents are computed We 
will also discuss about the various sources of error in dimension eshmation. 

3.7 Implementation Aspects of the 
Correlation Algorithm 

More than one researcher has commented on the dimension estimation problem 
as being more of an art than a science In this regard, Theiler has remarked 
that experience indicates the need for more experience [137] The reason for such 
remarks is that the estimation of dimension from experimental time series data 
involves approximations and balances which are often qualitatively arrived at A 
large amount of research effort has been expended to make the estimation procedure 
more precise and to study mathematically the vanous sources of estimation error 
Because of finite data length and the presence of low amplitude additive noise, 
dimension estimation is done at an interpoint distance r away from 0 The procedure 
to do this IS given in sechon 3 5 and in the following subsections we discuss the 
above mentioned issues further 

3.7.1 Practical Remarks 

A major advantage of the correlahon algorithm over box counting algorithms for 
dimension estimation is that in the former case, one can probe down to distance 
scales of 0{N'~^) le, upto the smallest interpoint distance, whereas in the latter, 
one can only probe upto le, upto the average nearest neighbour distance, 

for data length N and embedding dimension d Another factor which restricts the 
scale length le the extent of the abscissa in the \ogC{r,d, N) vs logr plots is the 
quantization of the data For 12-bit data, as in our case, the extent of the interpoint 
distance is over 13 octaves On the ordinate axis, the dynamic range of the correlahon 
sum varies from 0(1) downto O(^) (For example, log 2 (]|i) = -23, for N = 4096, 
see fig 3 1) To facilitate the estimahon of correlation dimension, it is desirable to 
riii/irio InorOfr d N) vs logr plots into 4 possibly overlapping regions 
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1 At length scales of the order of the smallest interpomt distance the correlation 
sum scales as a collechon of isolated points and hence is not suitable for dimension 
estimahon 

2 At small interpomt distance scales, the correlation sum may be inaccurate due to 
the presence of low amplitude additive noise in the data If the noise s d is o-, then 
the correlation sum at length scales below a is expected to scale as the corresponding 
embeddmg dimension d [102] 

3 Length scales above a are ideal for eshmating Dz 

4 At length scales approachmg the diameter of the attractor/state space extent of 
the trajectory, the correlahon sum saturates This region again cannot be used for 
reliable dimension eshmation 

Another important practical issue is the efficient computation ot the correlation 
sum A naive implementation would require the computation ot OiN’) interpomt 
distances for each embedding dimension d considered, using data ot length N 
Apart from eliminating the obvious redundant computations due to the nature of 
reconstruction vectors, eq (3 17), several algorithms have been proposed to reduce 
the computation effort A method to compute only the (){N) shortest distances 
with 0{N log N) effort is given in [135] Other algorithms use fast neighbour search 
routmes [56] and multidimensional (k-d) trees [15] for this purpose 

One popular method for reducing the computational complexity of the correlation 
sum IS to choose few reference points = x^, j = 1, , A^r^/, € (T N) and use 


where xf is given by eq (3.17) and d is the embedding dimension Compare this 
with eq (3 35) It is, however, shown in [138] that such a scheme tor i educing the 
computahonal complexity is not desirable because it decreases the precision ot the 
estimate To optimize the statishcal error with the computation time, it is necessary 
to take Nref of the order of N, le, the entire data length available 


(3 50 ) 


C{r,d,N,Nref) = 


1 


Nr,fN 


Nrd N 

EE® 

j -1 t 1 
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In the following subsection, we qualify the problem of estimating the correlation 
sum and dimension with a discussion of the various sources of error affecting their 
accuracy and precision. This is done m the next subsechon 

3.7.2 Sources of Error in Estimation 

The various sources of error may affect the eshmates in two fundamental ways - 
(a) It may cause the eshmate to be biased This is variously referred to in dimension 
literature as systematic error or accuracy of the estimate There are wide variety of 
sources leadmg to this kind of error such as lacunarity, prefiltering, quantization, 
additive noise, edge effects etc. (b) The statistical preasion or the standard deviation 
of the estimate from its mean value This arises from the finiteness of the data length 
available. The two kinds of errors usually work at cross purposes and they have to 
be balanced for optimal estimates 

1. Lacunanty: The underlying assumption m estimatmg D 2 from log ^(r, <i, iV) vs 
logr plots IS that the correlahon sum scales as a power of the interpoint distance 
The failure of this scaling to hold exactly is referred to as lacunarity This can be 
explicitly incorporated through a model L{r) in the correlation sum as 

C(r,d,iV) =L(r) (3 51) 

The middle-thirds Cantor set is a simple example which shows periodic lacunarity, 
le, L{r) = L{kr), k = ^, and for all r Lacunarity has the effect of introducing 
intrinsic oscillations in the correlation sum If Dt is estimated from a local slope, for 
example using eq (3 37), then a large systematic error may occur. Usmg eq (3 38) 
has the effect of reducing this systematic error particularly if several periods of the 
intrinsic oscillations are included in the scaling region (ri,r 2 ) The reader is referred 
to [136] and its references for further details 

2. Prefiltering: The phonemes of speech database 1 (Appendix B) were lowpass 
filtered at 7 5 kHz and sampled at 16 kHz The purpose of prefiltering is to attenuate 
high frequency noise in light of the fact that speech signals have negligible spectral 
content above this frequency level However, the process of filtering may lead to 
a systematic increase in the dimension estimate The process of filtering introduces 
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additional Lyapunov exponents) for the hme senes For example, an ideal lowpass 
Biter with cutoff n introduces a new Lyapunov exponent L; = -p, the other exponents 
remaining unaffected [6] Consequently, the Lyapunov dimension, given by eq (3 16), 
remains unchanged as long as p > |A,„| Otherwise, an overestimate results 

We have not computed the complete Lyapunov spectrum for the phonemes of 
the speech database However, the largest Lyapunov exponent for the 208 phonemes 
has been computed and the results summarized in chapter 2 The mean value is 
obtamed as 2899s-‘ In section 3.8, we present the results of second order entropy 
computation The mean value of K, is obtained as 8613 9s > Using these results 
along with Pesm's identity, eq (3 27), leads us to conjecture that the lowpass filtenng 
at 7 5 kHz does not introduce a bias in the dimension estimates 

The effect of convoluhon of a desired signal with a linear time invariant (LTI) 
system on the fractal dimension of an attractor is studied in [72] A sufficient 
condition is developed for the impulse response of the L'll system for which the 
dimension estimate is not affected Specifically, if the log magnitude of the largest 
pole radius of the LTI system is less than the smallest Lyapunov exponent, there will 
be no change in the dimension estimate due to the filtering Unfortunately, since this 
sufficiency test is m terms of the smallest Lyapunov exponent of a system, which is 
difficult to estimate accurately from time series observation of the dynamical system, 
this result is of little practical use 

3. Quantization: This causes a clustering of points of the reconstructed trajectory 
on the vertices of a hypercube in the embedded space, where the side of the cube 
is equal to the least significant digit due to quantization Heunstically, this should 
lead to a systematic underestimate of the dimension because the clustering of the 
trajectory to a finite number of points reduces its complexity Upto first order, the 
estimate D 2 of D 2 can be effectively modelled by 

D, = D2{1-^] P52) 

r 

where if is a positive factor of order unity, p is one half of the least significant 
digit and r = (rir2)3 (see eq (3 38)) [102] Assuming 12-bit quantization as in the 
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case of speech data, ri = % and rs = at d = 16, we have = -0 14% For 

the estimated mean value of Bo = 3 74 for the speech data base, this represents a 
negligible systematic underestimate by 0 005 A standard procedure for reducing this 
systematic error is to i^e dithering 

4 Additive noise: If the s d of noise is comparable to the length scale used for 
computing B 2 / then the estimate Z >2 continues to increase as d is increased, resulting 
in an overestimate and the appearance of approximate convergence. It is demonstrated 
m [102] that such an overestimate due to additive random noise occurs for all values 
of a An effective model for Do is given by 




(3 53) 


where K is a positive factor of order unity, a is the sd of addihve noise and 
f = (ri ry)^ For 12 bit data, assummg ri = |^, r 2 = and cr to be equivalent to the 
lower 6 bits at d = 16, le., a = 2®, we have = 3 1%. From this, the estimated 

mean value of Do = 3 74 for speech database 1, represents a systematic overestimate 
of 0 11 for the above parameters 

5. Autocorrelated data: Autocorrelation of the hme series may result in an anomaly 
in the correlation sum This issue and its remedy are discussed in section 3 61 

6. Statistical Error and Data Length: We have so far considered sources of error 
which lead to a bias in the dimension eshmate The statistical error or precision is 
solely an artifact of the finite data length used for obtaining the estimates The issue 
of statistical error in the estimation of correlation sum is addressed in [138] It is 
shown that the statistical error of the estimate scales genencally as 0{^) as iV — + oo 
There are exceptions which show O(^) scaling. 

The problem of minimum data length requirement for reliable dimension es- 
timates has also been widely investigated and various bounds reported To begin 
with, the correlation sum vanes from ^ to 1 for a data length of N samples If 
R = ^ IS the ratio of the interpoint distances over which the dimension is estimated, 
then 
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IS an absolute lower bound for N for a d-dimensional trajectory/attractor [137], [138] 
By appealing to the edge effect at large scales and nearest neighbour separation 
at small scales, it is argued in [130] that to obtain an estimate within 5% ot its true 
value, N is bounded by 

Mnm > 42-^ (3 55) 


However this exceedingly high data requirement has been variously rejected (see, 
for example [61]) because of the wrong assumption that the number ot neighbours 
required to resolve the correlation sum down to a specified interpoint distance r 
is the same as the density of points of the reconstructed trajectory m a hypercube 
of side resolution r m the appropriate embedding space This is only true for box 
counting algorithms such as those used for estimating the fractal dimension where 
the data requirements are much higher 

In another study, [104], the incorrect assumption ot the above bound is noted 
They use the combined influence of the edge effect at large scales and characteristic 
spacing between neighbours to arrive at a moderate bound 


,, ^ [2r(d/2 + i)]^ 

^^min rf+2 

(Alni?)“ 


2(/i:- i)_r((d + 4)/2) 
■(rij- r((d + 3)/2) _ 


ii 2 


(3 56) 


Here, d and R are the same as in eq (3 54) and A is the maximum error allowed in 
the estimate 

In a paper due to Ruelle [118], a fairly optimistic bound on the required data 
length N in terms of the correlation dimension d is given as 


Nrmn > lO' 


(3 57) 


Compare this with eq (3 54) 

As an example, for an estimate of the maximum correlation dimension d - 5 4, 
Nmin should be greater than 388, 2519 and 501 according to ecjs (3 54), (3 56) and 
(357) respectively using i? = 8 and A = 0 Id (i e , allowing a 10% error in the estimate 
usmg eq (3 56)). By comparison, the mean data length over 208 phonemes used for 
dimension estimation is 2986 It is yet not clear as to what constitutes a good criterion 
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for obtsinin^ nfiiniiniiiTi cist^ length roc^uirGmcnts TKg differing rcc^mrcrtionts for 
Nmm obtained with various criteria point to this 

3.8 Second Order Dynamical Entropy for 
Speech Time Series 

The speech database used for the computation of second order entropy K 2 is the 
same as used in the computation of correlation dimension and largest Lyapunov 
exponent. The metric entropy K is an upper bound of K 2 although the difference 
between the two values is usually negligible One generally computes K 2 instead of 
K because it is relatively easy to do so from the correlation sum Figure 3 6 shows 
a typical plot of K 2 {d) vs d for d = 1, , 15, eq (3 39) This plot is for the cardinal 

vowel utterance /a/ and is obtained from the logC(r,d,Ar) vs logr graph shown in 
hg 3 1 for the corresponding time series Table 3 2 summanzes the second order 
entropy analysis results for the 208 phoneme based hme senes of speech database 
1 These are based on the values of K 2 {d) at d = 15 The mean value of K 2 over 
208 phonemes is obtained as 8614s“‘ at 16 kHz sampling rate As m the case of 
correlation dimension, the mean value of K 2 over 4 speakers is the largest for lateral 
fricahves and the second largest for fricatives There is no further general trend to 
classify phonemes based on the second order/metric entropy However, for a state 
space model of the form of eq (3 1) and associated mitial condition (1 c ) sq of the 
vocal tract dynamics or the speech signal, the mean value of K 2 gives an estimate of 
Its predictability time Tp Thus, according to eq. (3 26), for an 1 c of 12 bit precision, 
Tp ~ 1 4 ms In chapter 5, we will check the veracity of this result by prediction 
using a model which assumes very little about the data charactenstics Alternatively, 
metric or second order entropy gives the average amount of information that should 
be optimally provided to a dynamical system at every iteration to maintain the level 
of precision of the trajectory at that of the initial condition 

Pesin's identity (eq (3 27)), which relates the metric entropy to the sum of 
positive Lyapunov exponents of a dynamical system, allows us to compare the 
respechve results for the case of speech signals The equality, which is expected to 
hold, IS not violated as can be seen by comparing the second order entropy results 
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16000 

12000 
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0 

Embedding Dimension 

Fig. 3.6: Graph for the computation of the second order entropy tor cardinal vowel 
utterance /a/ using the logC(r, d, A^) vs logr plots ot I ig 3 1 
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3 Dynamics! Analysis of Speech Signals~~2 

in Table 3 2 and the mean largest Lyapunov exponent values in Table 21 
positive values of the largest Lyapunov exponent and second order entropy both 
give evidence of the average divergence of nearby trajectories of speech signals m 
the reconstructed state space This significant observation allows us to distinguish 
speech signals from strictly periodic or quasipenodic behaviour 



Chapter 4 

Polynomial Prediction of Speech 


The dynamical analysis results of chapters 2 and 3 will form the basis of our 
investigation of some nonlinear prediction schemes for speech modelling and coding 
in a state space framework. The observation that speech is largely a low dimensional 
signal suggests that few state space variables will be needed to model it When 
a time series evolution is under observation, information is acquired at a rate of 
/„ = -/, log 2 r, where r is the resolution of the measurement and /, is the sampling 
frequency However, the rate at which the underlying dynamics produces information 
IS given by the metric entropy For speech database 1 (Appendix B), the second 
order entropy results of chapter 3 show that while informahon is acquired at a rate 
of 192 kb/s (12 bits/sample at 16 kHz), the information produchon rate is only of the 
order of 9 kb/s Over and above, if we allow for lossy coding of speech, which is 
normally the case, the required rate of transmission can be brought down further by 
a significant amount. (Note that the above figure of 9 kb/s is for 16 kHz sampling 
rate and not the usual 8 kHz for telephone grade speech ) Also, if the estimahon 
results of chapter 6, of the lower bound of the rate distortion function for speech 
sources assuming memory, are any indicahon of the possible limits to coding, then 
present day speech coders are still far off from them 

Most present day speech coders in the low and medium bit rates, which repro- 
duce natural sounding speech, are linear prediction based analysis - by - synthesis 
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coders Linear prediction has served speech coding purposes very well for decades 
and continues to do so The reasons for this are not difficult to find The theory of 
linear filtering is well developed, the superposition principle which is the cornerstone 
of linearity is often used to simplify coding complexity etc and the model ot a 
source exciting a time varying linear filter can be seen to approximate the human 
speech production mechanism at certain levels 

There have been isolated attempts in the recent past in the study of nonlinear 
representational forms for speech coding We have recorded them in the historical 
note of section 1 6 Most of the recent advances in speech coding, however, have 
come through manipulations of the excitation function rather than changes in the 
model form itself 


We will study some nonlinear representation forms tor predictive modelling 
of speech in a state space framework in chapters 4 and 5 I his is a deterministic 
modelling framework as one for waveform coding must necessarily be I he appeal 
of such a framework lies in the mtuitive idea of the possibility of capturing apparent 
random or chaotic behaviour m compact model forms That this inverse problem 
is indeed addressable has been shown for specific examples in [33] There are 
two basic schemes for predictive state space modelling These are the global and 
local prediction schemes In a global prediction scheme, the function parameters 






are optimized over the entire state 

funchon is optimized over a local volume m stale space where the prediction is lo 
be done We consider these two basic approaches ,n chapters 4 and 5 respectively 

The organization of this chapter is as follows In section 4 1, we review the 
salient features of the analysis - by - synthesis class ot linear predictive coders 
We also discuss the basic structure and analysis steps ot a Ch.l.P coding scheme 
Some model based analysis pointers to the presence of nonlmeanties in the speech 
signal are given in section 4 2 Section 43 is concerned with the development ot 
the analysis steps of polynomial prediction modelling ot time series data A stale 
space formulation of the modelling problem is also given in this section In seclion 
4, we give results of our experiments on polynomial prediction ot speech and Us 
comparison with the traditional linear prediction and make some observations 
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4.1 Analysis - by - Synthesis Linear Prediction Coders 

In this sechon, we describe the salient features of the analysis - by - synthesis 
class of linear prediction coders. Thereafter, we will study the code excited linear 
prediction (CELP) coder, which belongs to this class of coders Some of the early 
coders belonging to the analysis - by - synthesis class were the multipulse linear 
prediction coder (MPLPC) [4], the regular pulse Imear prediction coder (RPLPC) [88] 
and the CELP coder [123] 

The basic structure of an analysis - by - synthesis linear predichon coder is shown 
in fig 41 The name analysis - by - synthesis obtains from the fact that the coding 
parameters for successive blocks of speech are obtained by actually synthesising the 
reproduced speech in the coder itself Thus, the function of the decoder is emulated 
in the synthesis block of the coder also which is used to determine the optimal set of 
parameters meant for transmission The filter models the short term correlations 
in the speech signal, i e , its spectral envelope It has the form 

—L. = i (4 1) 

where, Ut, i = 1, ,p are the short term predictor coefficients and p is the order 
of the filter The typical values of p range from 8 to 16 assummg 8 kHz sampling 
rate, and the coefficients are determined from the speech signal in an open loop 
manner using linear prediction (LP) techniques [114]. Usually, the autocorrelation or 
stabilized covariance method [5] is used to determine the LP coefficients The filter 
coefficients themselves are computed from speech frames of 10 to 30 ms duration 
and adapted every frame For purposes of coding, the LP coefficients are normally 
converted to line spectral pairs (LSP), also known as line spectral frequencies (LSF), and 
quantized [131]. 

The filter ^ models the long term correlations in the speech signal correspond- 
ing to Its spectral fine structure and has the general form 

^ ^ \ (4 2) 

B{z) 

where i = —q, , r are the (^ +• r -f- 1) long term prediction coefficients and D is 

usually the pitch delay The number of coefficients typically varies from 1 (r = ^ — 0) 
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to 3 (r = ? — 1) The delay D and coefficients b^, i — —q, ,r are determined either 
from the speech signal or from the residual signal after removmg the short term 
correlation as shown in the analysis block of fig 41 [115] The filter coefficients are 
usually adapted faster than the short term prediction coefficients at durations of 5 to 
15 ms For a sampling rate of 8 kHz, a suitable range of the delay D is between 16 
and 143 which corresponds to a pitch range between 56 and 500 Hz. The integral 
delay D can then be coded using 7 bits. The filter coefficients are generally scalar 
quantized for coding 

The long term prediction filter is not always explicitly used in the coder. For ex- 
ample, in RPLPC and MPLPC, the excitation functions are often capable of modelling 
the pitch period redundancy In earlier versions of the CELP coder, a long term 
prediction filter was considered necessary However, nowadays its function is usually 
taken care of by an excitation function chosen from a so called adaptive codebook [81] 
This allows for an optimal closed loop search of the appropriate excitation function 
with relatively less complexity and also gives scope for mcorporating fractional pitch 
delays [85] which improves the naturalness of reproduced speech. 

The filter excitation function representmg the coded residual can be chosen in 
a variety of ways Indeed a major portion of the research effort in the analysis - 
by - synthesis class of speech coders is given to the design of appropriate forms of 
excitation functions to serve various purposes In MPLPC, the residual is coded as a 
sequence of pulses located at nonuniformly spaced intervals The excitation analysis 
procedure has to determine both the amplitudes and positions of the pulses In 
RPLPC, the residual is coded as a set of uniformly spaced pulses The position of the 
first pulse within the excitation frame and the amplitude of the pulses are determined 
in the analysis procedure In a CELP coder, the excitation vector is chosen from a 
"shape" codebook and a scalar gam factor is chosen from a "gam" codebook in such 
a manner that a weighted distance between the original and reconstructed signal is 
minimized We will discuss the structure and operation of a CELP coder in more 
detail in section 4 11 

The coder structure of fig 4 1 minimizes a weighted mean square error between 
the original and reproduced speech signals for each frame corresponding to the 
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durahon of the excitation vector The weighting filter W{z) incorporates a model of 
auditory perception based on the phenomenon of masking Its function is to reduce 
the perceived noise in the reproduced speech The effect of masking is to reduce the 
perception of one sound in the presence of another The human auditory mechanism 
has only a limited ability to detect small errors in the frequency bands where the 
speech signal has high energy (for example, in the formant regions) By minimizing 
the perceptually weighted mean square error, one makes use of the masking effect, 
m that the quantization noise is redistributed in relation to the speech power over 
different frequency bands A suitable weighting filter is given by, [87], [5] 


W{z) = 


Aizjy) 


1 - Ef 1 

1 ~ El '' 


0 ' •> ' 1 


(4 >1) 


where A{z) is given by eq (4 1) and the parameter i lies between 0 and 1 and 
controls the degree by which one wants to deemphasize the toimant regions in the 
quantization error spectrum 

For a more detailed review and a unified presentation of the basic coder forms 
of the analysis - by - synthesis class, the reader is reterreci to (87) 


4.1.1 Structure and Analysis of a CELP Coder 

In the analysis - by - synthesis class of linear prediction coders, the code excited 
linear prediction (CELP) coder is the most promising one in terms of pertormance 
parameters Consequently, significant attention has been given in the recent past to 
improve the CELP coder structure For a recent exhaustive survey ot the advances 
made m this direction, see [51] In the tollowmg, we limit ouiselves to a brief 
development of the essential features that constitute a unit analysis tiame ot a ChLP 
coder 

Figure 4 2 shows a structural block diagram of the CFLP coder I his is based on 
an alternahve representation of the analysis - by - synthesis class ot linear prediction 
coders compared to fig 41, and which is also frequently used (see, eg [86], [87]) 
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The short term linear prediction analysis is done over one analysis frame of 
speech data Let us denote the sequence of data in an analysis frame by Xn, n = 
0, 1, ,Nf - 1 The following analysis, however, is done over one excitation frame 
and IS repetitive for each such frame Let us denote the sequence of data in an 
excitation frame by Xn, n = 0, 1, ,Nf -1. The analysis frame length is chosen as 
an integral mulhple of the excitation frame length Although the LP analysis is done 
over an analysis frame of speech data, the coefficients may be interpolated for each 
excitation frame in the LSP domain The LP analysis determines the coefficients of 
the two filters in the coder The function of the long term prediction filter, as shown 
in fig 4 1, IS replaced by an excitation function chosen from an adaptive codebook 
and appropriately gain multiphed 

For each excitation frame of speech data, Xn, u = 0, 1, ,Ne — l, the perceptually 
weighted speech signal a:“, n = 0, 1, ,iVe - 1 is obtained from 

x':: = Xr,*hZ. n = 0,l, ,iVe-l (44) 


where, /i“ is the truncated impulse response of W(z) For obtainmg the synthesised 
speech, the two filters H[z) = ^ and W{z) = are combined into one filter 

given by 




A{zli) 


(4 5) 


The reproduced speech n = 0,1, can be considered as a sum of the 

zero input response, Zn, n = 0, 1, ,iVe — 1 and the zero state response, y„, n = 
0, 1, of the H'^{z) filter correspondmg to its excitation Thus, 


— Zn yni 


n = 0, 1, , Ne - 1 


(4 6) 


The analysis - by - synthesis procedure must determine for the excitation frame, 
two indices corresponding to the excitation vectors from the adaptive and stochastic 
codebooks and two gain indices for the respective vectors by minimizing 


n=0 


(4 7) 



analysis 



I 


Fig 4 2 Structuie ot a 
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where 


= « - ^n) - 

= n=(), 1, ,N,-l (48) 

For rrotahonal convenience, we use vector representation tor sequences of length 
etc Thus, eq (4 7) can alternatively be written as 

E = (4 9) 

where 

r.v, il^ (4 10) 

We will henceforth drop the superscript since all the vectors are ot length N, 

The minimization with respect to the adaptive codebook index and gam and 
the stochastic codebook index and gam is done sequentially m two steps In the 
first step of the mmimization procedure, the indices corresponding to the adaptive 
codebook and its associated gam codebook are found In the second step, those of 
the stochastic codebook and its associated gam codebook are determined In the 
following, we discuss the two steps m the minimization procedure 

STEP 1 - Adaptive Codebook Stage: The filter output due to the excitation from 
the adaptive codebook attempts to model the long term correlation structure ot the 
speech signal The name "adaptive" codebook derives because it is updated after 
each excitafion frame of speech is processed This way, the codebook tracks the 
varying long term correlation structure of speech Since the codebook update is 
based exclusively on the filter excitation vector of the previous frame which is also 
available to the decoder, no extra bits of information have to be transmitted to update 
the codebook on the decoder side [85] 

Let be a candidate vector given by 

J = l, ,N., 


(Ill) 
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where is an excitahon vector from the current adaptive codebook, is the 
corresponding optimal gam factor derived below and is the size of the adaptive 
codebook In the first stage of minimization. 


n = 0,l, ,iVe-l 


(4 12a) 


e(o) = x“' - z 


(4 12b} 


where the vectors comprise of appropriate scalar entries accordmg to eq. (4 10) 


Also, 


t = l, ,Na 


(4 13) 


IS the scaled filtered codeword where 


ho 0 0 

hi ho 0 

■ hN^-l hN,-2 ^iVe-3 


(4 14) 


IS the N, X TV, lower tnangular matrix whose columns comprise the truncated impulse 
response of the ^ filter The corresponding square error is given by 




(4 15a) 


where 


e(‘) = - y« 


(4 15b) 


The optimal gam factor is obtained by minimizing with respect to gi'’ and is 


given by 




(4 16) 


The problem then consists of finding the adaptive codebook index out of i - 


1, , Na which minimizes 


B(.) = _2j<0e<")'H.xW + (j<'>)' x<‘>'hIH„xW 


(4 17) 
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where is the quantized value ot obtained trom a corresponding scalar code- 
book 

STEP 2 - Fixed/Stochastic Codebook Stage: The name "fixed" codebook is to contrast 
it with the adaptive codebook since this is a prefixed codebook It is also referred 
to as a "stochastic" codebook because its candidate sequences are generally random 
in nature 

Let u = be the first stage excitation vector The zero state response ot 

H'^{z) due to this input is given by 

yi = H,„u (4 18) 

Step 2 IS similar to step 1 except in the codebook and the target vector e'*’) Here, 

= x’" - z 11 , (4 19) 
yW = H,„vW, j 1, (4 20) 

where Nc is the size of the stochastic codebook and 


v(0 = (4 21) 

tW IS an excitation vector from the codebook and g[''> is the corresponding optimal 
gain term obtained by minimizing 


i ?(0 = 


(4 22) 


where 


and and yfi) 
optimal value of 


e(') = - y^‘) 

are given by eqs (4 19) and (4 20) respectively in this step 

(0 , 

Qa IS given by 


(4 23) 
The 





The problem in this step then is to find the codebook index i, out ot i - 1, 
which minimizes 


(4 24) 
,N„ 


(4 25) 


S<*) = + (<;<•)) 
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Here is the quantized value of \ (eq (4 24)), obtained from a corresponding 
scalar codebook 

This completes the discussion of the analysis - by - synthesis steps for an 
excitation frame of speech data Figure 4 3 shows a block diagram representation 
of a CELP decoder. The information that is transmitted to the decoder once every 
excitahon frame duration includes the adaptive codebook index, ta, the corresponding 
gam mdex, the stochastic codebook index, i„ and the corresponding gam index In 
addition, the quantized filter parameters are transmitted once every analysis frame. 

Before closing this secton, we note that the above CELP coding structure is only 
one of the common variants of the basic form. We have implemented a CELP coding 
scheme based on this structure for the purpose of comparison of the reproduced 
speech with that of a coding scheme to be proposed m chapter 5 

4.2 Model based Indicators of Nonlinearities 
in Speech 

In chapter 1, we made a general case for the study of nonlinear modelhng schemes 
for speech based on certam observations of the speech production mechanism and 
the speech signal itself, hmitations of a Imear modelling scheme and recent advances 
made in the theory and practice of nonlmear processing methods In continuation 
of that theme, we present some experimental results based on speech models which 
give further indication of the need to study the performance of nonlinear models 
for speech This section can be looked upon as an interface between the nonlinear 
dynamical analysis work of chapters 2 and 3 and the nonlmear speech modelling 
and coding study to follow in the forthcoming sections and in chapter 5. 

A. Dynamical Analysis of MPLPC Error Sequence 

Some results of a correlation dimension analysis of the LP residual sequence are 
reported in [143], [144] In these studies, 10 ms segments of speech are used to 
adapt LP filter coefficients (model order = 12) and compute the dimension for 30s of 
the residual sequence obtained from successive speech segments It is reported that 
the scaling behaviour in the \ogC{r,d,N) vs logr plot (see, for example, sections 
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3 5 and 3 6) is less clear for the residual sequence compared to that ot the speech 
signal Also the saturation of the correlation dimension is at a significantly higher 
value compared to that for speech signals This is plausible because the analysed 
residual sequence is a piccczoise /mcar (effectively nonlinear) transformation of the 
speech signal, and their corresponding correlation dimensions need not be simply 
related However, if there is a deterministic explanation for speech signals, then 
Its correlation dimension is theoretically the same as that ot the residual sequence 
obtained by fitting a linear time senes model (fixed coefficients) with finite number of 
lags [19] This suggests that it is more appropriate to find the correlation dimension 
of the residual sequence obtained from individual speech segments and compare 
rt with that of the corresponding speech signal We will give some results of the 
dynamical analysis of reconstruction error sequences ot an MPLPC scheme 


We have implemented a version ot the MPLPC' scheme in which iterative 
refinement of the predictor coefficients and the multipuKe excitation is done [109] 
The number of iterations was apnori fixed at 5 in which the parameters mostly 
converged We did not use a long term predictor or the perceptual weighting 
filter Other specifications are dictated by the requirement of large data length per 
speech segment, since we have to approximate the N > -x limit in the correlation 
dimension algorithm Each speech segment analysis contained 80 ms of individual 
phoneme articulations sampled at 8 kHz The LP filter order j; was fixed at 10 
and 80 pulses were used to code the residual from each 80 ms speech segment 
Thus, dynamical analysis was done on N = 040 samples of the reconstruction error 
sequence obtained from individual speech segments Both the correlation dimension 
Dt (eq. (3 36)) and the second order entropy K-i (eq (3 39)) were computed from 
12 phoneme articulahons comprising of the 8 cardinal vowels and one consonant 
each of types nasal, voiced and unvoiced fricative and approximant (Appendix , 
database 1) Figure 4 4 shows a plot of Di[d) vs d where d is the embedding 
dimension. Similarly, fig 4 5 shows a plot of AL(d) vs d tor increasing values of 
d The value of for the reconstruction error sequence of the above data set is 
obtained as 4 25 ±1 33 Similarly, the value of K> is 7731 1 1807s ‘ The error values 
indicate the standard deviation over the 12 samples 
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For the purpose of comparison, the mean values of D 2 and Ko for the speech 
signals of the same segments are obtained as 3 75±0 79 and 7266±2424s"^ respectively 
We also evaluated Dz and Kz for white Gaussian noise sequences of the same data 
length. Ideally, Dz{d) = d m this case Similarly, Kz should be mfinite to indicate true 
randomness. Due to various approximations such as the use of finite data length, the 
computing technique and the method of generating the i i.d. sequence itself (through 
pseudo-random number generators, which are chaotic maps) it is only possible to 
approximate the ideal results for Dz and Kz However, it is important to observe 
that the dynamical invariants are capable of distinguishing a random sequence from 
the reconstruction error sequence of an MPLPC scheme This is evident from the 
comparative plots of figs 4 4 and 4 5 In the case of the 1 1 d sequence, the value of 
Dz{d) does not saturate with increasing d as seen from fig 4 4 Similarly, the value 
of Kz IS significantly higher than that of the reconstruction error sequence (fig 4 5) 

Thus, the reconstruction error sequence is relatively low dimensional and has 
finite predictability It can be argued that dynamical structure is still present in the 
error sequence which is amenable to state space modelling using a few number of 
variables Alternahvely, the speech signal itself may be modelled using a deterministic, 
nonhnear state space model 

In [143], [144] an experiment is descnbed to determine the presence of a nonhnear 
deterministic component in the LP residual A linear prediction analysis is carried 
out on the speech signal and the residual is replaced with Gaussian white noise of 
the same energy This noise is then used to excite the LP filter to produce synthetic 
speech The new signal is characterized by the property that its determmism is limited 
to the linear components — the best possible predictor of this signal is linear By 
comparing the correlation dimension of this synthehc signal with that of the origmal 
speech signal, one can investigate the properties of the LP residual The argument 
goes that if the correlation dimension of the noise excited LP filter is higher than that 
of the original speech signal, then one can conclude that the LP residual contains a 
nonhnear, deterministic component The synthetic speech signal was found to have 
a correlation dimension of 3 9 compared to 2 9 of the original speech signal. From 
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Fig 4.4: Graph for the computation of correlation dimension tor (1) reconstruction error 
sequence obtained from an unvoiced fricative articulation /t/ using a MPLPC 
scheme, and, (2) white Gaussian noise sequence 
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the higher value of the former, it is concluded that a suitable nonlinear predictor 
should be able to perform better than a Imear predictor 

B. Evidence from Polynomial Model of LP Residual 

In an interesting expenment reported recently [139], direct evidence is given of the 
presence of short term quadratic nonlmearity in speech signals after removing all 
Imear dependence with a linear predictor The steps of the experiment consist of 
the following 

(i) Fit an optimal order linear predictor to a frame of speech signal and obtain 
the residual sequence Call this the first stage residual. Compute the signal to 
prediction error raho (SPER) 

(u) Fit an optimal order linear predictor to the stage residual sequence and 

obtain the + stage residual sequence Compute the SPER Repeat (ii) until 

SPER ~0 dB Let the final stage of LP analysis be the stage When the SPER 
of the LP analysis is 0 dB, all linear information withm p samples will have 
been extracted from the original speech signal 
(m) Apply an optimal short term nonlinear predictor based on p samples to the 
stage residual and get the (j stage residual Compare the and (j + 1)‘^ 
stage residuals 

The paper reports this comparison for a polynomial (quadratic) filter without 
the linear part Both voiced and unvoiced speech segments were studied The 
memory, p, of the Imear as well as the nonlinear predictor is set to 10 samples Five 
successive LP analysis stages were sufficient to bring down the SPER to ~ 0 dB It 
IS observed that the nonlinear predictor provides an additional prediction gam over 
the LP stages (No comparative numbers are provided However, the residuals are 
graphically displayed ) Further, pitch periodicity is almost completely removed with 
the short term nonlinear predictor which was still present even after the 5 stages of 
linear prediction 

4.3 Analysis for Polynomial Prediction of Time Series 

The choice of a polynomial representational form is usually one of the first out of 
the infinitely many possibilities This is because it is a simple extension of a linear 
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rnodd srid tln^ optimal set of coefficients/ in the sense of nunimuni ni s e C9n be 
obtained by solving a set ot simultaneous linear equations A polynomial predictor 
attempts to model the higher order correlations in time series It we look at the 
modelling problem as a purely waveform matching exercise, then the question of 
interest is whether a polynomial form gives better prediction gam compared to a 
linear model A deeper issue, which we have not investigated here, is that of the 
perceptual relevance of the new moment information that gets modelled 

In predictive modelling, there are two time frames ot interest One is the 
prediction frame, whose length d gives the lag upto which signal correlations are 
considered for modelling In linear prediction modelling, the model order p is equal 
to the prediction frame length d However, in a polynomial ditterence equation 
model, the total number of coefficients, p is given by p , where I is the 

degree of the polynomial and d is the prediction frame length A disadvantage of 
a polynomial prediction scheme is that the number ot model coefficients increases 
rapidly with d and I The second time frame of interest is the analysis frame, (length 
Nf) from which the optimal set of model coefficients are evaluated Broadly, the 
choice of this length is bounded on the lower side by the requirement of a stable 
solution for the coefficients and on the upper side by the data length available, or, 
as in the case of speech, by the length for which signal stationarity can be assumed 

Let us consider the analysis steps leading to the determination of the optimal 
set of coefficients of the prediction model This development is analogous to the 
covariance method of linear prediction We assume an analysis frame length of data 
Xn, n = 0, ,Nf — l is available, over which the polynomial prediction model has 
to be optimized with respect to minimum m s e The covariance method requires 
knowledge of Xn, n = -d, , -1 of the previous frame also, where d is the maximum 

prediction delay A polynomial difference equation is of the form 

p 

^ (IrnPn + (4 

rn-l 

Pn^V={pl,pi, (4 27a) 


where 
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(4 27b) 


Here, d is the maximum prediction delay, I is the polynomial degree and V is the 
set of candidate polynomial terms The problem is to fit the equation 

p 

^ OmPn (4 28) 

m=l 

to the data x,i, n = 0, 1, , AT/ - 1 in the analysis frame to minimize 

N,-l 




where 


e~n Xn ®rn n, 0, 1, , Nj 1 

This leads to a set of p simultaneous linear equations 

p 

XI ^ ton ) = {^np\) , » = 1, ,P 

m~\ 

where < > denotes a time average over Nf points Let us define 


(4 29a) 


(4 296) 


(4.30) 




ton), = ,p 

(Pn®n) , * = 1) .P 


(4 31a) 
(4 316) 
(4 31c) 


Then, to obtain the optimal prediction model, one needs to solve the following for 
Om, m 1, ,p. 


r0‘>l 

,^2.1 ^2,2 
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ci>' 
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-ax' 




<12 

= 

^2.0 


-Op. 


o 

1 


or 


$a = ^ 


(4 32a) 


(4 326) 


The above equation can be solved with the Cholesky decomposition procedure This 
method allows us to compute the reductions in the average prediction error over the 
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analysis frame with the addition of successive terms in the model [114] It must be 
noted that we have not yet specified the order m which the terms are arranged in 
the model of eq. (4 26) This will determine the increase in average prediction gam 
as successive model terms are included from 1 to p We have principally considered 
two ordermg schemes for the model terms 

( 1 ) In the first method, we exhaust all possible terms upto a certain time lag before 
considenng terms which include signal dependence tor greater lags While this does 
not completely specify the ordering of terms tor a general degree L polynomial, the 
ordermg for a quadratic polynomial, which we have studied tor speech, is specifically 

7 > z 

given by 0C„_i,x;_i,a:n-2iain-l3Cn-2i-^n 2 '^n l^n 1 , J'n Z^n h I'ri i' 

(2) The second method, which is potentially more interesting, is based on orthogonal 
term search [82] Suppose that we are interested in Lonsidering nonlinear terms 
upto degree I and lag d The total number ot possible candidate terms is given 
by C = of which only p terms are to be chosen such that the addition ot 
each successive term in the polynomial model leads to the maximum reduction m 
prediction error over the analysis frame at that instant lor this. Gram Schmidt 
orthogonalization is performed on the polynomial basis set (see eq (4 27)) This is 
combined with Cholesky decomposition tor tast candidate search at each stage of 
term inclusion From eq (4 26), we have 

p 

Xji — ^ ^ ffrn "h (4 33) 

rn 1 

where w’^ is related to by the Gram Schmidt orthogonalization procedure 

pI = 

m-l 

Pn = m -r- 2, , J1 (4 34) 

r -1 

Thus, 

10 0 1 
V2i 1 0 0 r“;{ 

«31 v-52 1 0 '■ (4 35 ^i) 

Vpl Ij ” 
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or 

P = Vw (4 356) 

The m s e in terms of the onginal coefficients a^, m = 1, ,p is given by 

E = {el) 

P 

= (®n> - am (XnPn) 
m=l 

= (4 36) 


where is given by eq (4 31c) and a and ^ are as given m eq (4 32) 

In terms of the orthogonal basis set w, eq. (4 35), the m s e is equivalently given 

W>-E»™(«)') (4 37) 

m=l 

where g,n, m = I, ,p are the coefficients given by eq (433) Suppose that 
OrPD, 1 < r < p IS the last polynomial term added to the model m eq (4 26) The 
addition of this term reduces the m s.e. by an amount 

AEr = gl («)") (4 38) 


where 



(4 39) 


The polynomial term pj^ should be chosen from the remaming candidate terms at 
the stage such that A£'r is the maximum A fast orthogonal search method to do 
this using Cholesky decomposition is given in [82] This process is repeated until all 
the p terms are successively found 


We briefly look into the computational complexity of Imear prediction and 
polynomial prediction schemes There are two major sources contributing to the 
computahonal complexity in either scheme These are the multiplications involved 
in the computation of the covariance matrix $ and the solution of the p simultaneous 
equations to obtain the optimal set of coefficients For the linear prediction scheme, 
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O(Nfp) multiplications are required to obtain the <I> matrix and 0(_p'^} multiplications 
are required to solve the simultaneous equations by Cholesky decomposition method 
which IS used for the covariance method For the case ot quadratic difference 
equations, 0{Nfd^) multiplications are required to find the entries ot the ^ matrix and 
O(p^) multiplications are required to solve the simultaneous equations by Cholesky 
decomposition for the first method of ordering ot terms given above For the second 
method, a fast orthogonal search algorithm requires 0(pCNf f p*C) multiplications 
to obtain the optimal set of coefficients after the computation ot the <F matrix [82] In 
the above, Nf denotes the length of the analysis trame, p is the number of coefficients 
in the model, d is the maximum prediction lag considered and C is the total number 
of candidate terms from which the p polynomial terms are chosen 

We have implemented the above two term selection procedures to study the 
performance of polynomial prediction ot speech 'I he results are elaborated in sechon 
44 

4.3.1 State Space Formulation of Polynomial 
Predictive Modelling of Time Series 

The problem of polynomial prediction of scalar time series can be cast into a one-step 
prediction problem in d-dimensional state space, where d is the maximum time delay 
upto which signal correlations are considered in the polynomial model While such 
a reformulahon may not offer additional insights in the present prediction problem, 
it will be immensely useful in the understanding of the local state prediction problem 
to be studied in chapter 5 We give this formulation here for the sake of completeness 
of discussion in terms of state space predictive modelling 

Let us assume, as before, that time series data x„, n = -d, ,N/ - 1 is known 

Here, d and Nf are the predichon and analysis trame lengths respectively Reconstruct 
a d-dimensional vector time series xf,, n = -1, ,Nf - 1, where 

JC„ iX„] (4 40) 

Then, a state space model of is given by 

4 = g(x(l ,) -kef, (441) 
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x„ = hJ'xi n = 0, 1, ,Nf-l 

(4 42) 

where 


ei = [0 0e„]^ 

(4 43) 


= [0 01]^ 

(4 44) 


5(4- 1) = 4 

d ^ 

= ^n-d-4-2 l)] 

(4 45) 


Let us denote 

x„ = /(4_|) 

= E<^!>r "=o,i 

m=l 

Then, e„ in eq (4 43) is given by 

e„=x„-x„, n = 0,l Nf- 1 (4 47) 


IS the nominal trajectory shadowing the original reconstructed trajectory 
It differs from only in the last coordinate To find the model coefficients, 
rn= 1, ,p, eq (4 46), use any one of the following two equivalent mse 

minimization criteria 

(1) In terms of the norm of the vector difference, minimize 


Nf-l 


E 


— tl 


(4.48) 


where xi and H are given by eqs (4 40) and (4 45) respectively, and || « refers 
to the Li norm 

(2) In terms of the scalar difference between the original and reconstructed hme 


1 




(4 49) 


series, minimize 
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where Xn is the original time series and x„ is given by eq (4 46) 

In the next section, we discuss the results of polynomial prediction of speech 
and make some observations 

4.4 Results of Polynomial Prediction of Speech 

We have implemented the polynomial prediction scheme and studied it tor speech 
for the quadratic case, i e , degree I = 2 Both the methods ot ordering of model 
terms as discussed in section 4 3 are considered Further, we have compared the 
prediction gains with that ot a standard linear prediction scheme tor the same number 
of coefficients in both models 

The phoneme specific sentence utterances ot speech database 2 (Appendix B) 
were used as the basis of comparison Each ot the 4 sentences spoken by three 
males and three females were used tor this study The total duration ot speech signal 
corresponding to each sentence spoken by the 6 speakers is as follows sentence 
(a) Why were you away a year, Roy’^ - 14 25s, sentence (b) Nanny may know my 
meaning - 11 59s, sentence (c) His vicious father has seizures - 15 74s, and sentence 
(d) Which tea party did Baker go to? - 15 42s Thus, a total ot 57 0s ot speech 
was used Figures 4 6(a)-(d) show the segmental prediction gam (based on analysis 
frame length Nf) expressed as segmental SNR in dB tor the two methods of term 
arrangement m the quadratic model for each of the four sentences respectively The 
figures also show the corresponding segmental prediction gain tor a linear prediction 
scheme (covariance method) Figure 4 6(e) shows the segmental prediction gains 
corresponding to the three cases averaged over all the four sentences The analysis 
frame length, Nf was fixed at 160 samples corresponding to 20 ms ot speech sampled 
at 8kHz For the LP filter, the number of coefficients was varied from 1 to 20 For 
the first method of term arrangement in the quadratic model, the maximum delay 
d was fixed as 5 This allows a maximum number of 20 coefficients in the model 
without a constant term The number of model terms were increased from 1 to 
20 according to the term ordering strategy outlined in section 4 3 For the second 
method of term arrangement corresponding to orthogonal selection, the candidate 
set included all linear terms upto a delay of 10 samples and all quadratic terms upto 
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No. of parameters 


(e) 


4 6- The seemental pred.chon gam expressed as SNR m dB as a function of the 
mode, coefhcenl for three types of predictors « - 

shown by dashed line, (ti) Quadratic predictor using first method of ' 

:elechonf denoted by (1), and (iii) Quadrahc 

based on orthogonal term selechon, denoted by (2) . 

respective graphs for the 4 sentences (aH^) respectively (see text) SP°P“ 
Tmdes and 3 females Part (e) shows the 3 graphs averaged over all the 4 

sentences spoken by the 6 speakers 
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a delay of 6 samples Thus, from a total ot 10 + 21 =■ .il candidate terms, 20 

terms were successively selected tor each analysis frame according to the orthogonal 
selechon procedure of method 2 and the corresponding increase in prediction gam 
was recorded with the addition ot each successive term 

In context of the above, the following observations are noteworthy 

(1) The quadratic prediction scheme, based on the first method ot term arrangement, 
does not perform as well as the LP scheme in terms ot segmental prediction gam 
(SNR) This can be seen from the graphs ot figs 4 6(a)-(e) For 10 coefficients per 
model, the quadratic predictor gives 3 09 dB /t’ss segmental prediction gam compared 
to the Imear predictor. At 20 coefficients per model, this ditterence is 2 08 dB 

(2) It must be noted that while the linear predictor availed ot the correlations upto a 
time delay of p samples, the quadratic predictor is based on signal dependencies upto 
a lag of d samples where d < p For example, tor p 9, d 3 and tor p = 20, d = 5 
In view of the poor prediction performance ot the quadratic predictor based on the 
first method of term arrangement, it may be worthwhile to first remove the linear 
dependence in the signal and then apply the nonlinear predictor to the residual 
signal Based on this intuitive idea, we studied the performance ot the following 
two stage predictors In the first stage, an optimal order LP filter is applied to the 
speech signal per analysis frame to give the first stage residual In the second stage, 
a quadratic predictor (1 to 20 coefficients based on a maximum prediction delay, d, of 
5 samples) was optimally determined from the first stage residual for each analysis 
frame, to get further prediction gam We compared the prediction performance with 
a LP filter (number of coefficients varying from 1 to 20) at the second stage The 
additional segmental prediction gam at the second stage obtained over the entire 
speech database with the quadratic predictor is 2 07 dB for 10 coefficients and 2 83 
dB for 20 coefficients compared to 0 54 dB and 1 17 dB tor 10 and 20 coefficients 
respectively with the LP 

(3) The second method of terms arrangement of the quadratic predictor gives a 
modest improvement m the segmental prediction gam compared to the IP case tor 
the same number of coefficients This can be seen from the comparative graphs ot the 
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overall segmental SNR in fig 4 6(e) and for sentences (a), (b) and (c) in figs 4 6(a)-(c) 
respectively However, for sentence (d), the LP gives slightly better segmental SNR 
tor number of model terms exceeding 8 For 10 coefficients per model, the quadratic 
predictor gives 0 11 dB improvement m overall segmental predichon gain compared 
to LP At 20 coefi5cients per model, the improvement is 0 33 dB (fig 4 6(e)). 

(4) For the orthogonal term selechon procedure of the quadratic model, we recorded 
the frequency at which each term was selected at a particular position, for all positions 
in the model It was found that the terms x„_i, Xn- 2 , Xn -3 and x „_4 were selected 
more often at the 2"^, places respectively compared to other positions, 

but with decreasing frequency in that order The frequency of positions of selection 
of other terms was comparahvely more spread out 

In a similar study of quadratic predictors reported recently [139], the basis of 
comparison with LP is the time delay d upto which signal correlations are considered 
m the model rather than the number of coefficients p as above. For a time delay, 
d = 10 samples, the number of coefficients in the linear predictor is 10 while that 
in the quadratic predictor (mcluding 10 linear terms) is 65 For this case, usmg an 
analysis frame length Nf - 200 samples (25 ms at 8 kHz sampling rate) gives a 
segmental prediction gain of 18 2 dB with the quadrahc predictor compared to 14 1 
dB of the Imear predictor This corresponds to a significant improvement of 4 1 dB 
with a quadratic predictor model Another important observation is that the short 
term quadratic predictor is capable of modelling the pitch period redundancy to a 
great extent which is not possible with a order LP 

The above observations suggest that while a quadratic filter may not offer 
significant advantage over a Imear predictor in the usual forward adaptive mode in 
which it IS used in most medium to low bit rate coders, it may give some advantages 
over LP when used m a backward adaptive mode (for example as in low delay, 
medium bit rate coders) which does not require the transmission of the model 
coefficients The possible advantages may be in terms of prediction gam and the 
better ability to model the pitch period redundancies with a short term predictor only 
These speculahons must be seen m perspective of the 16 kb/s LD-CELP for example. 
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which uses a relahvely large 50'‘ order l.near pred.Cor (m backward adaphve mode) 
m order to model the lorrg term correlahorts m the signal as well 



Chapter 5 

Local State Prediction Coding 
of Speech 


By casting a scalar time senes prediction problem in state space by reconstructing a 
vector time series, we can use any one of two basic prediction schemes, namely, the 
global and local prediction schemes The reconstruction of a vector time senes from a 
scalar observable using the method of time delays has sanction in dynamical systems 
theory through Takens' theorems which we discussed m Chapter 2 There exists a 
deep connection in the form of differentiable equivalence between the reconstructed 
vector time series and the original dynamical system trajectory whose time evolution 
IS monitored through the scalar observable (whose prediction we are interested in 
over here) In chapter 4, we investigated a global prediction scheme for speech using 
polynomial representation form It is easy to see that the usual linear prediction 
schemes (forward adaptive, backward block adaptive and recursive prediction) can 
be formulated as problems of one step global prediction of linear dynamical systems 
in appropriately reconstructed state space (sections 1 4 and 4 31) In contrast to 
a global prediction scheme where the system parameters are optimized over the 
reconstructed vectors over the entire state space, a local prediction scheme optimizes 
the parameters over a local state space volume where the prediction is to be done 
Thus, a local state predichon scheme reduces the dependence on the representation 
form compared to global state prediction 
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A Local State Prediction (LSP) scheme can be implemented as tollows 

• Embed the scalar time series in a trajectory in state space using the method of 
time delays (see eg chapter 2) Call this the "reconstructed trajectory" and the 
dimension of the state space as the "embedding dimension", d 

• For a one step prediction of a target vector on the reconstructed trajectory, 

★ Find a local neighbourhood of Nf, trajectory points of the target vector from 
(possibly fixed number Nf of) previous points on the trajectory, 

★ Fit a local model between the trajectory points in the neighbourhood and 
their respective future points, 

★ Use this prediction model on the target vector to obtain its prediction 

• Project the predicted vector on an appropriate coordinate axis depending on the 
method of reconstruction, to obtain the predicted scalar value 

We will study the prediction properties of LSP tor speech in terms ot the segmen- 
tal prediction gam, plots of the residual sequence, their spectrum and autocorrelahon 
function and compare them with those of (i) short term Linear Prediction (LP) resid- 
ual, and, (ii) short term plus long term LP residual In a local state prediction scheme 
for speech, an appropriately chosen neighbourhood will contain trajectory points that 
are close to the target vector m time as well as those which are approximately an 
integral number of pitch or formant periods away. Thus, a LSP attempts to simulate 
the function of both short term and long term linear predictions simultaneously 

The motivation for studying the prediction properties of a LSP scheme is to 
explore whether it can be advantageously used in a speech coding scheme A 
natural method of incorporating LSP in a speech coder is to use it analogous to a 
backward adaptive scheme This is because the usual forward adaptive method of 
obtaming the optimal parameters within a frame of speech and transmitting them 
may prove to be extremely expensive in terms of bit rate for a local prediction scheme 
where optimal model fits have to be done in local neighbourhoods within the frame 
Thus, a speech coding scheme based on LSP can provide low coding delay in which 
context backward adaptive and recursive LP schemes are usually studied and used 
We propose a framework for low to medium delay speech coding in the medium bit 
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rate range based on LSR The coder uses an analysis — by — synthesis scheme and is 
structurally similar to CELP It is tentatively named as a Vector Excited Local State 
Predichon (VELSP) coder 

The organization of the chapter is as follows In section 5 1, we discuss the 
steps involved in local state prediction analysis Sechon 5 2 is concerned with the 
performance of iterahve prediction of speech based on an autonomous LSP system 
This prediction performance is interpreted m terms of the metric entropy results of 
chapter 3 We discuss in some detail, the predichon properhes of a one step local 
state predictor for speech in section 5 3 Sechon 5 4 recapitulates some of the recent 
studies in local methods for speech prediction and coding Finally, in sechon 5 5, 
we propose the VELSP coding scheme and discuss the performance of a skeletal 
shucture of such a coder 

5.1 Local State Prediction (LSP) Analysis 

In this sechon, we will elaborate on the steps given above for the implementahon 
of LSP The basic idea is illustrated in fig 51 Let us consider that a scalar hme 
senes n = -Nj - d + I, ,-lis available based on which we have to predict 
the hme series for Np steps from n = 0 to n = Np - I using LSP Here, Nf is the 
analysts frame length and d is the embedding dimension of the state space in which 
local state prediction is to be done The choice of embedding dimension d comes 
from a knowledge of the dimension of the hme senes and the attendant necessary 
and sufficient condihons for state space reconstruction (chapter 3) or from extensive 
experimentation with several realizations of the underlying process of the time series 
observable 

The first step is to reconstruct a vector time series x^, n = -Nf, 1, where 

~ [®n-d4-l ®n— d+2 ®n.— 1 ®n] (5 1) 

The local state predictor is given by 

= g(Xn-l) 

= h^xf^, n = 0, 1, ,Np-l 


(5 2a) 
(5 26) 
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Fig. 5.1: Schematic showing local state prediction A local neighbourhood ot the current 
trajectory point is found from previous trajectory points to predict the 
future point 


Also, 


xli = [i_<j xi d 
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Here, g(x^_i) is the local state predictor of x^_i and /(x)( 0 is a local representation 
form 

In order to evaluate /(x^_i), we need to define a AT/, point neighbourhood of xi [ 
A simple way to assign neighbourhoods is to partition the state space into disjoint 
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sets using, for example, a rectangular grid. While such an approach is convenient 
and efficient in the case of successive predictions using the same analysis frame, it 
has the disadvantage that there is no overlap between neighbourhoods. A point near 
the boundary of its neighbourhood may be poorly approximated One way to cope 
with this problem is to perform mterpolahon between optimal functions of adjacent 
neighbourhoods However, this becomes a difficult task for state space dimension 
greater than two An alternative that is more accurate than disjomt partitions and 
more convenient than enforcmg matching conditions at the boundaries is to overlap 
the boundaries in such a way that each prediction is done from a good set of 
neighbours 

For predicting a point x^, we choose a Ni point nearest neighbourhood in the 
sense of minimum Euclidean distance from the available reconstructed trajectory 
points, x^, n = —Nf, , —2 Let these Nl pomts and their future points be denoted 
pairwise as (xf^,xf,^i), (xf^,xf^^i), 

The next step is to obtain the parameters of a local representation form which 
maps the N[, pomts to their respective future points in some optimal sense Various 
representation forms can be chosen for this purpose However, the dependence of 
the prediction performance on representation can be expected to be less for LSP 
compared to global prediction A trivial choice of a local representation is a constant 
map The next higher level of approximation, which we have studied for speech, is 
provided by a local Imear predictor Such a predictor for a state space point x*^ is 
given by 

/ (x*^) = ao -f- [ai aa] x*^ (5 4) 

where oq, ,aa are the d + I predictor coefficients If the number of nearest 
neighbours = d + 1 , then one can simply do linear mterpolahon to find the 
coefficients However, one is generally interested in the case Nc > d, which provides 
an overdetermmed set of linear equations in the predictor coefficients 


(5 5) 


aJtj + i = ao -F [ai aa] Xt,, jf = 1, 
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where is the d!-^ scalar component of The {d -t- 1) coefficients are then 

solved from a set of simultaneous linear equations which are obtained by doing a 
weighted minimization of E with respect to ao,ai, ,ad, where 


and 


> 



The local linear predictor (LLP) can be seen as a generalization ot the usual linear 
predictor (LP) A LP based on d delays can be considered as a rZ-dimensional plane 
passing through the origin in {d + 1) dimensional space, where the d axes correspond 
to the d delay coordinates and the {d f axis gives the prediction coordinate A 
corresponding LLP can be considered as a surface 5 in a (d f 1) dimensional plane 
The scalar prediction of the point will be given by the (d + 1)^^ coordinate 
of the corresponding (d + 1) dimensional point on S The LLP /(xf, eq (5 4), will 
be given by a plane tangent to the surface at that point The additional constant, 
oo, m the predictor accounts for any local mean of the reconstructed trajectory, x^ 
Also note that if the local neighbourhood extends to include the entire state space, 
then a LLP reduces to the usual linear predictor 

In the following two sections, we study the properties of many step, i e , iterative, 
and one step prediction of speech using local linear state prediction 


5.2 Performance of an Autonomous 
Local State Prediction System for Speech 

We will first look into the prediction performance of an autonomous local state 
prediction system for speech, le, one which does not get external excitation after 
starting the iteration from an initial condition To begin with, let us consider a speech 
frame, x„, n = —Nf — d -y 1, 1, is available The first step is to reconstruct a 

vector trajectory x^, n = -Nj, 1 in d- dimensional state space using eq (51) 
We will refer to this reconstructed trajectory of length Nf as the analysis frame 
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The next step is to predict Np samples n = 0, ,Np~l using eqs (5 2)-(5 6) 
If the corresponding speech samples are given by a:„, n = 0, ,Np-l, then 

^n,i ~ ®n,t ^n,n n. = 0, 1, , 1 (5 7) 

denotes the prediction error at iteration step n + 1 The additional subscript i is to 
denote the analysis frame number Successive Np step predictions are performed 
by shifting the analysis frame by Np pomts on the speech hme series The above 
prediction scheme is shown schematically m fig 5 2. It is expected that the average 
prediction error is a function of the iteration step — as one predicts further in time, 
the prediction based on the analysis frame and the previous reproduced pomts is 
expected to get worse We quantify this, for the case of speech, by plotting the 
prediction gain, PG(n), in dB, as a function of the iteration step n, m fig 5 3 Here, 


analysis frame 


fff TRATEcrORV 
IN d-DIMENSIONAL SPACE 



Np STEP ITERATIVE 
PREDICTION 


?Nn-l 


NEXT Np STEP 
PREDICTION FRAME 


Fig. 5.2: An Np step 


iterative prediction using a local state predictor 
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2 

PG(n) = 10 logiofe^, n = 0,l, ,Np-l ( 5 g) 

We have used the 4 sentence utterances of speech database 2 (Appendix B) by 3 male 
and 3 female speakers to obtain the plot of fig 5 3 The other parameter choices are 
analysis frame length Nf = 160, number of nearest neighbours Nl = 40, embedding 
dimension d = 10 and number of iteration steps Np = 10 An autonomous local state 
prediction system is unsuitable for use as such in a speech coder because of the 
rapid decrease m the prediction gain with increasing iteration steps Even if we 
allow an excitation function for the prediction system, in order to get a prediction 
gain independent of the iteration step number, one has to incorporate an asymmetric 
dependence of the excitation function on the iteration step which is undesirable 

The graph of fig 5 3, however, serves to verify the results of second order 
dynamical entropy, K 2 , for speech reported in chapter 3 Recall that for an autonomous 
dynamical system, the metric entropy K quantifies the rate of loss of information 
about Its initial condition Also, K > K 2 If i? is the resolution ot the initial condition 
m bits, then the dynamical system can be iterated for Nt ^ ^ steps on the average, 
before all information about the initial condition is lost and no further iteration is 
possible For K 2 — 0 5A bits/sample (8613 s“^ at 16 kHz - see Table 3 2), and i? = 16 
bits, At < 29 7 iterations It is seen from fig 5 3 that the average prediction gain 
with an autonomous LSP system goes below 0 dB after 9 iterations on the average 
The difference in the two results can be largely attributed to the fact that while the 
former is a theoretical estimate, the latter is a model specific result Nonetheless, 
the comparative values reinforce our faith m the second order entropy results for 
speech time series 

5.3 One Step Local State Prediction of Speech 

In this section, we study in some detail the one step local state predictor for speech 
For the one step case, Np = 1 m the LSP analysis formulation of section 5 1 (eq 
(5 2)) Based on length Nf + d—1 of scalar data, a d-dimensional vector time series of 
length Nf is reconstructed Next, a one step prediction into the future is done using 
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Fig. 5 3. Prediction gain vs iteration step using an autonomous local state predictor 
system 
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LSP The subsequent analysis frame is obtained by shifting the previous trame by 
one sample and another one step prediction is performed This process is repeated 
for the entire time series 

Given a scalar time series, Xn, n = —Nf - d + 1, , —1,0, 1, , iV - 1, the first 

4 - (£ _ 1 samples are used to form the first analysis frame based on which ig 
IS predicted Let the predicted sequence be n = 0, 1, ,N - 1 The predicted 
point 5,, i = 0, 1, ,N - 1 uses an analysis frame of scalar points x„, n = i - (N/ + 
d—1), ,1—1 There are various factors in a local state prediction scheme which 

can affect its prediction performance These are the vector analysis frame length Nj, 
the embedding dimension d and the number of nearest neighbours N[,, apart from 
the representation form for the local fit For a local linear representation, we have 
found the segmental prediction gains (henceforth referred to as segmental SNR) for 
each combination of Nf, d and Ni from Nf = 160, 240, .320 d = 1,4,7,10 and Nf = 
10, 20, 30, 40, 50 The results are summarized in figs 5 4(a)-(e) and fig 5 5 The 
segmental SNR reported in this chapter (and elsewhere) are based on lengths of 160 
samples unless explicitly stated otherwise The speech database used for this study 
consists of the 4 phoneme - specific sentences 

(a) Why were you away a year, Roy'? 

(b) Nanny may know my meaning 

(c) His vicious father has seizures 

(d) Which tea party did Baker go to? 

Each sentence was spoken by 3 males and 3 females (speech database 2, Appendix 
B) Figures 5 4(a)-(d) show for the 4 sentences (a)-(d) respectively, the segmental 
SNR in dB as a function of d for Ni = 10,20,30,40 and 50 each and Nf = 160 Figure 
5 4(e) shows a plot of the above averaged over all the 4 sentences The variation 
of the segmental SNR with the analysis frame length Nf is shown for d = 10 and 
Nl = 20, 30, 40 and 50 m fig 5 5 

The following observations regarding the segmental SNR can be made 
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5.4: 




Graphs showing segmental prediction gam m dB vs the state space or embed- 
Jing dimension d for a local state predictor Parts (aHd ^ 

sentence utterances (see text) and part (e) gives the overall 8 P 

shows 5 plots corresponding to local neighbourhood size, Nl = W, 2U, 
and 50 The analysis frame length Nf = 160 samples 
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Fig. 5.5: Segmental prediction gain in dB as a function of the analysis frame length N/ 
for a local state predictor The 4 plots correspond to local neighbourhood size 
Ni— 20, 30, 40 and 50 In each case, the embedding dimension, d = 10 
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(1) Comparing the graphs of fig 5 4 (a)-(d), it is seen that the segmental SNR is 
maximum for sentence (a) which comprises of vowels and glides only The 
segmental SNR is minimum for the utterance of sentence (c) which contams 
only voiced and unvoiced fricatives besides vowels. 

(2) One can compare the segmental SNR obtained for the 4 sentence utterances and 
for the overall database using LSP in figs. 54 (a)-(e) with that obtained using 
short term LP (covariance method) m figs 4.6 (a)-(e) It must be noted that 
the LP scheme optimizes the predictor parameters within an analysis frame (of 
160 samples in this case) and the prediction is done within the frame whereas a 
LSP scheme predicts a point based on analysis frame consisting of past samples 
While the LP is a forward block adaptive scheme, the one step LSP is reminiscent 
of recursive adaptation Comparing the segmental SNRs for the entire database 
(figs 5 4(e) and 4.6(e)), we see that LSP performs margmally better than LP for 
d = 4, 7 and 10 and Nl = 40 and 50 in LSP In both cases, Nf = 160 samples For 
d = 10 and = 40, for example, the segmental SNR using LSP is 16.4 dB The 
comparative figure for LP is 16 08 dB 

(3) The segmental SNR using LSP with a local neighbourhood of Ni = 10 samples 
IS quite poor at all reconstruction dimensions. The prediction performance for 
Ni = 30, 40 and 50 is relatively similar This can be seen from figs 5 4(a)-(e) 

(4) The relative increase in the segmental SNR becomes less with increasing state 
space dimension This increase is only 0 68 dB as d is increased from 7 to 10 for 
Ni =40 (fig. 5 4(e)) 

(5) It is seen from fig 5 5 that the segmental SNR decreases slightly as the analysis 
frame length is increased from 160 to 320 samples This decrease may be 
attributed to the nonstationary nature of the speech signal This suggests that 
predictive schemes based on local methods must use an adaptive analysis frame of 
Ni ~ 160 samples rather than a fixed frame consisting of very large number of speech 
samples 

In order to get a better understanding of the prediction properties of a local 
state predictor, we have compared the LSP residual of segments of running speech 
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with the short term LP residual and the short term plus long term LP residual m 
figs 5 6 - 5 12 This companson is done in terms of the time series plots, their 
autocorrelation functions and the spectrum in parts (a), (b) and (c) respectively of 
each figure Based on such studies, the following general observations about the 
prediction scheme can be made: 

(1) The performance of a LSP can be broadly said to he between a short term LP 
and a short term plus long term LP both in terms of segmental prediction gam 
and the abihty of a LSP to remove correlations in the speech signal This can be 
expected because local state prediction is based on a local neighbourhood of the 
target vector which contains reconstructed trajectory vectors that are close to it m 
hme as well as those which are close to it in space. The vectors which are close 
to the target vector in space but not in time are those which he approximately 
an integral number of pitch or formant frequencies away from it This gives a 
LSP Its abihty to model long term signal correlations It is also worth noting 
that compared to a long term linear prediction filter which explicitly uses a pitch 
predictor, no hard decision regarding the pitch period has to be made in a local 
state predictor 

(2) The spectrum of a LSP residual appears ^^whiter” than that of a short term LP 
residual The LSP is comparatively better at removing spectral peaks in the 
lower frequency range (< 1000 Hz) than at the higher frequencies 

(3) It has been observed that if the spectrum of the LSP residual contains some 
prominent peak, then this spectral content is generally not removed by long 
term hnear prediction as well 

(4) Like other backward adaptive schemes, the major disadvantage of a LSP scheme 
lies in its uiability to track sudden changes in signal characteristics compared 
to forward adaptive schemes Such an example is shown in fig 5 13 where a 
comparison of the LSP residual with the forward adaptive short term LP and 
short plus long term LP residuals is also made In another example. Table 5 1 
shows comparative SNR values of successive frames based on 160 samples each 
The first frame denotes almost silent region There is a sudden increase in the 
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0 80 160 


Discrete time 
(a) 

Time series plots - order LP used to remove short term correlations from 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are Nf = 320, Nl = 40 and d = 10 The SNR based on 160 samples for the 
short term LP residual, short term plus long term LP residual and LSP residual 
are 14 58 dB, 15 23 dB and 19 16 dB respectively 
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Autocorrelation function plots corresponding to the 4 time series plots of part 
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(c) 


Spectrum plots corresponding to the 4 hme senes plots of part (a) computed 
using 160 point DFT Note the relative magnitudes of the plots 


Fig- 5 6. Comparative plots of the speech signal, short term LP residual, short term plus 
long term LP residual and local state prediction residual in terms of (a) time 
series, (b) autocorrelation function, and, (c) DFT spectrum 
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Time series plots - 10^'" order LP used to remove short term correlations from 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are N/ = 320, Ni = 40 and d = 10 The SNR based on 160 samples for the 
short term LP residual, short term plus long term LP residual and LSP residual 


are 15 69 dB, 21 53 dB and 17 31 dB respectively 







Autocorrelahon function plots corresponding to the 4 time series plots of part 
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(c) 


ipectrum plots corresponding to the 4 time series plots of pt 
using 60 point DFT Note the relative magnitudes of the plots 
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Speech signal 





(q) 


Time series plots - order LP used to remove short term correlations from 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are Nf = 320, = 40 and d = 10 The SNR based on 160 samples for the 

short term LP residual, short term plus long term LP residual and LSP residual 
are 26 55 dB, 31 48 dB and 31 10 dB respectively 
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(c) 


Spectrum plots corresponding to the 4 time series plots of part (a) computed 
using 160 point DFT Note the relative magnitudes of the plots 


Fig. 5.8: Comparative plots of the speech signal, short term LP residual, short term plus 
long term LP residual and local state prediction residual in terms ot (a) time 
series, (b) autocorrelation function, and, (c) DFT spectrum 
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0 80 160 240 320 

Discrete time 

(a) 

Time series plots - 10^^ order LP used to remove short term correlations from 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are Nf = 320, Nl = 40 and d = 10 The SNR based on 160 samples for the 
short term LP residual, short term plus long term LP residual and LSP residual 
are 21 88 dB, 26 61 dB and 21 41 dB respectively 
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Autocorrelahon function plots corresponding to the 4 time series plots of part 
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Spectrum plots corresponding to the 4 time senes plots of part (a) computed 
using 160 point DFT Note the relative magnitudes of the plots 


Fig. 5 9: Comparative plots of the speech signal, short term LP residual, short term plus 
long term LP residual and local state prediction residual in terms ot (a) time 
series, (b) autocorrelation function, and, (c) DFT spectrum 
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0 80 160 240 320 

Discrete time 
(a) 


Time senes plots - order LP used to remove short term correlations trom 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are iV} = 320, Nl = 40 and d = 10 The SNR based on 160 samples tor the 
short term LP residual, short term plus long term LP residual and LSP residual 
are 12 06 dB, 16 51 dB and 13 04 dB respectively 
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(c) 


Spectrum plots corresponding to the 4 time series plots of part (a) computed 
using 160 point DFT Note the relative magnitudes of the plots 


Fig 5 10: Comparative plots of the speech signal, short term LP residual, short term plus 
long term LP residual and local state prediction residual in terms of (a) time 
series, (b) autocorrelation function, and, (c) DFT spectrum 
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Discrete time 
(q) 

Time series plots - order LP used to remove short term correlations trom 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are N/ = 320, Ni = 40 and d = 10 The SNR based on 160 samples tor the 
short term LP residual, short term plus long term LP residual and LSP residual 
are 27 02 dB, 30 14 dB and 27 26 dB respectively 
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Speech signal 




(q) 

Time senes plots — 10*^ order LP used to remove short term correlstions from 
the speech signal The long term LP uses 3 coefficients The LSP parameters 
are Nf = 320, Nc = 40 and d = 10 The SNR based on 160 samples for the 
short term LP residual, short term plus long term LP residual and LSP residual 
are 8 68 dB, 9 77 dB and 6 51 dB respectively 




Autocorrelahon function plots corresponding to the 4 time series plots of part 

[a) 
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Fig 5.12 



(c) 


Spectrum plots corresponding to the 4 time series plots of part (a) computed 
using 160 point DFT Note the relative magnitudes of the plots 


Comparative plots of the speech signal, short term LP residual, short term plus 
long term LP residual and local state prediction residual in terms of (a) time 
series, (b) autocorrelation function, and, (c) DFT spectrum 
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0 80 160 240 320 

Discrete time 


Fig. 5.13: Time series plots to show the inadequacy of the LSP scheme to track sudden 
changes in the data compared to the forward block adaptive case of the LP 
filters The SNRs for the two 160 sample frames are (i) 25 22 and 10 89 dB, (ii) 
26 18 and 10 53 dB and (iii) 22 67 and 0 84 dB for (i) short term LP residual, (ii) 
short term plus long term LP residual and (iii) local state prediction residual 
respectively 
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signal amplitude in the form of a voiced utterance m the second frame which 
continues in subsequent frames The LSP fails to predict the transient in the 
second frame and the SNR picks up only in subsequent frames 


Frame Number 

hb 

2 

3 

4 

5 

Short term LP residual 


10 39 

1144 

22 00 

23 52 

Short plus long term LP residual 


11 03 

12 38 

23 08 

26 39 

LSP residual 

22 67 

084 

6 69 

18 29 

24 63 


Table 6.1: SNR values for successive 160 sample frames to illustrate the inadequacy 
ot the LSP scheme to track sudden changes in signal characteristics. 


Computafaonal Complexity Considerations: A local state prediction is significantly 
more expensive in terms of computational complexity compared to the usual linear 
prediction To predict one sample using LSP the major sources of computational 
complexity are in the following three operations 

(1) To find Ni nearest neighbours from N/ trajectory points one has to compute 
the (J\[f - 1) distances from the target vector in d-dimensional space 

(2) To fit an optimal local linear predictor between the Ni nearest neighbours and 
their future points, one has to compute a (d + 1) x (d + 1) covariance matrix 
that requires 0{Ni^) multiplications Note that a simplification to 0{Nid) 
multiplications as in the case of linear prediction is not possible 

(3) To solve for the (d + 1) coefficients of the predictor, an additional O(d^) multipli- 
cations are required with the Cholesky decomposition method 

One can compare the above with the computational complexity of a linear 
prediction scheme which requires 0{Nfd) multiplications to compute the entnes of 
the covariance matrix and O(d^) multiplications to get the d optimal LP coefficients 
These computations are required to predict all the Nf samples within an analysis 
frame in the forward adaptive case 
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5.4 Recent Studies in Local Methods for 
Speech Prediction and Coding 

Some researchers have reported the study ot local methods tor speech prediction 
and codmg in the recent past We will discuss the salient features ot these schemes 
in terms of the terminology used in the previous sections 

One ot the first methods in this category is the Pattern Search Predictor (PSP) 
of Bogner and Li [18] This predictor is similar to the LSP except tor an addihonal 
signal matching scheme to scale all reconstructed vectors by appropriate norms to 
account for variations in signal amplitude Only a single neighbour is chosen (le, 
Ni = 1) from an analysis frame of length N/, and the prediction ot the target vector 
is taken to be the future sample of the nearest neighbour They have incorporated 
a PSP in a standard CCHT 32 kb/s ADPCM coder and studied its performance 

A preliminary version of the LSP, termed as the Compromised Overlapping 
Neighbourhood (CON) - Local Approximation Technique was studied and reported 
by us in [90], [92], [93] In this approximation technique, the prediction is performed 
wifhm the analysis frame While this scheme gives better prediction properties 
compared to LSP it is not suitable for use in a speech coder for obvious reasons 

In another study due to Townshend [143], [144], a prediction scheme similar to 
LSP IS used on the LP residual in an ADPCM speech coder The local predictor, 
however, finds a local neighbourhood from a fixed analysis frame ot LP residuals 
corresponding to a relatively large frame length of 30 s of speech 

In a paper due to Wang et al, [119], a nonlinear predictor for speech based on 
vector quantization techniques is studied In this scheme, two vector codebooks 
are designed corresponding to voiced and unvoiced speech The training vectors 
used m the design of the codebooks are of the form given in eq (5 1) Two scalar 
codebooks are also designed corresponding to approximately optimal predictions of 
the codevectors of the respective codebooks A scalar prediction ot a target vector 
IS done by makmg a voiced/un voiced decision for the vector, finding the nearest 
codevector in the appropriate codebook and outputting the scalar value corresponding 
to that codevector index in the associated scalar codebook. This method can also be 
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regarded as a local prediction scheme Compared to LSI^ this prediction scheme uses 
a fixed analysis frame of very large length (corresponding to the training set size), 
a fixed neighbourhood parhtion and a zeroth order local predictor (corresponding 
to the sunple averaging of the future points of the traming vectors in each local 
neighbourhood) 

A "nonhnear oscillator model" for the reproduction of speech waveform has 
been reported very recently by Kubin and Kleijn [89] This model also essentially 
employs a local prediction scheme It uses an adaptive analysis frame However, its 
differences with LSP he mainly in the use of a more general reconstructed vector 
compared to eq (5 1), and in the use of a zeroth order local predictor 

Smger et al [129] have interpreted the problem of local state prediction as a 
codebook prediction problem In this interpretation, the reconstructed trajectory 
vectors within an analysis frame and their future points, together form a codebook 
pair Given a target vector whose future sample is to be predicted, the same steps 
as enunciated above for LSP are followed 

5.5 Structure and Performance of a Vector Excited 
Local State Prediction (VELSP) Coder 

A speech coding scheme which incorporates a local state predictor must be tailored 
to take advantage of the features of the predictor Two observations regarding a LSP 
are as follows (i) it simulates the function of both short and long term predictors 
to make predictions, and (ii) it is more suitable for use in a backward adaptive form 
where a predichon is made ahead of an analysis frame rather than within it as in 
forward adaptive prediction Based on the prediction performance of a LSP and the 
above observations, we study a basic speech coding scheme using LSP This coder 
is suitable for further development in the medium bit rate range of 48 - 16 kb/s 
range and can theorehcally provide low to medium coding delay The structure of 
the coder and the decoder are shown in figs 5 14 and 5 15 respectively This is an 
analysis - by - synthesis coding scheme and is similar to CELP in principle We 
tentatively name the coder as a Vector Excited Local State Prediction (VELSP) coder 
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Fig. 5.15: Basic structure of a Vector Excited Local State Prediction (VELSP) decoder 
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There are two frame lengths involved in this coding scheme One is the analysts 
frame which consists of previous reproduced speech based on which LSP is performed 
and the second is the excitation frame whose length denotes the excitation sequence 
length and the length for which prediction is performed based on one analysis frame 

The crucial difference of a VELSP from a CELP is in the use of a nonlinear 
predictor rather than a linear prediction filter Consequent to this, it is not possible to 
break up the excitation vector into a "gain" factor and a "shape" vector and compute 
the indices from the respective codebooks in the usual way One possible method of 
overcoming this difficulty is to use an excitation vector codebook which incorporates 
the gam factor However, given the dynamic range ot speech, a large size codebook 
would be required which makes an optimal excitation vector search impractical We 
attempt to overcome this problem by normalizing the reproduced speech used in 
the analysis frame, performing LSP based on the normalized frame and doing a gam 
multiplication of the "normalized" predicted frame in the end to get a frame length 
of reproduced speech Thus, the steps involved in getting an excitation frame length 
of reproduced speech are as follows 

(1) Normalize the analysis frame length of immediately previous reproduced speech, 

(2) Corresponding to an index i of the excitation codebook, do the following 

(i) Get one excitation frame length of "normalized" reproduced speech, 

(ii) Get the corresponding ophmal gain factor by minimizing the m s e between 
this excitation frame length of "normalized" reproduced speech and the 
original speech The reproduced, speech is given by the gain multiplied by the 
"normalized" reproduced speech 

(3) Repeat step (2) for all indices t of the excitation codebook The coder transmits 
that index and the corresponding quantized gam index ig which minimize 
the m s e between the original and reproduced speech of one excitation frame 
length 

In the following, we will discuss the analysis steps of a basic VELSP coder, the 
design of excitation vector and gam codebooks and the performance characteristics 
and scope for improvement of the coder 
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A. Analysis Steps of a VELSP Coder 

Let us consider the analysis steps required to obtain the parameters for transmission 
corresponding to one excitahon frame length N^. of input speech, a;„, n = 0, 1, ,Nf- 
1 We will need (i) an "analysis frame" of reproduced speech, n = -Nj - 
d + l, ,-l, where d is the embeddmg dimension and Nf is the analysts frame 
length corresponding to the length of the reconstructed trajectory in d- dimensional 
space, (ii) an "excitation codebook" of size Nc consisting of excitation sequences 
n = 0, ,iVe - 1, 1 = 1, ,iVc, and (iii) a scalar "gam codebook" of size Ng, 
consisting of gam values i = 1, ,Ng We will discuss about the design of 
the two codebooks after an understanding of their functions Given the above, the 
analysis steps are as follows 


STEP 1: Obtain the normalized trajectory of reproduced speech n = 

-Nf, ,-l. 

where 

, rp 

Yn ~ [yn-d^lijn-d-f-Z yn-l^n] y 

(5 9) 

Vn -- 

= 7 ?^, n = -Nf-d + l, ,-l 

^rrns 

(5 10) 

and 




(5 11) 

STEP 2: Corresponding to 

an excitahon codebook sequence index t, 


(i) Obtain "normalized" reproduced speech y„, n = 0, 1, , ATe - 1 according to the 

following 

= g(yti) 

(5 12a) 


yn = h^yl, n = 0, 1, 

(5 12b) 

where, 

T— 1 

o 

o. 

II 

(5 13a) 


T 

g(yn-l) = [yn-dl-iyn-d+2 yn-iyn] 

(5 13b) 

and 

! 

11 

(5 13c) 
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Here, is a local state predictor whose torm is chosen to be a linear model 

plus a constant term, eq (5 4), in the coder studied by us 

(u) Get the corresponding optimal gam tactor 


y-V, 1 -2 

Z^n t) 


(5 14) 


The above equation results by minimizing the m s e E with respect to g, where 


V, I 


E 


TV S ^ 


(5 15) 


ri 0 


Also, 


X = gijn^ ri - 0, 1, , N, - 1 


(5 16) 


IS the reproduced speech corresponding to the excitation index i and the quantized 
gam factor g 

STEP 3: Repeat step 2 for excitation indices i — 1, ,Nr Choose the index if and 
the corresponding quantized gain index ig which minimizes 


Er 


N, 1 

^ E 


N. 


'^nY 


71 0 


(5 17) 


The two indices and ig are transmitted per excitation frame to the decoder The 
decoder also has the analysis frame of reproduced speech r„, n = -Nf - d+l, ,-l 
It executes steps l(i) and eq (5 16) to get the reproduced speech, 2„, n = 0, 1, , iVg-l 

The above analysis shows that one needs to store input speech of length Ne 
before any processing for the frame can begin The theoretical minimum coding 
delay will therefore be given by SATf samples or .s where / is the sampling 
frequency of speech 

B. Design of Excitation and Gain Codebooks 

We give herein an approximate procedure for the design of the excitation and gain 
codebooks required by the VELSP coder and decoder Given a sequence of speech 
data, the corresponding vector and scalar training sequences are tound from the 
following steps 


J 
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STEP 1: For one frame of speech data, Xn, n = -Ny -d-hl, , -l, 0, 1, , AT, - 1, 

(i) Get a "normalized" frame i/„, n = -Nj -d + 1, where 

yn = (5 18a) 

^rms 

and 

j iv.-i ■ 

Nf+d + Ne-1 (5 18f>) 

J n=-Al^-d+l 

(u) Predict yo usmg LSP on the analysis frame y„, n = -iV/-d+l, ,-l using the 

procedure given above 

STEP 2: For n = 1, , A/e - 1, do the following 

(i) Shift the analysis frame by one sample to the future, 

(u) Predict y„ 

This gives one training sequence y„, n = 0, 1, , A/g - 1 for the excitation code- 

book. 

STEP 3: A training point y for the gam codebook is obtamed using eq (5 14) which 
results from the minimization of the m s.e. between Xn and y„, n = 0, 1, , A^e - 1 

From successive frames of speech, one can get the training sets using steps 1-3 
for the design of the two codebooks The gam and excitation codebooks can then 
be designed using the Lloyd and the generalized Lloyd algorithms respectively [52] 
C. Performance of a Fully Quantized VELSP Coder 

We have implemented and done preliminary performance study of a basic VELSP 
coder at three operation rates of 8 0 kb/s, 6 5 kb/s and 52 kb/s. The coder parameters 
used for each bit rate of operation are shown in Table 5 2 It is seen that the 
theoretical minimum coding delay at 5 2, 6 5 and 8 0 kb/s operahng rates is equal to 
7 5 ms, 6 0 ms and 4 875 ms respectively 

Codebook Design: The excitation and gain codebooks were designed using the steps 
given above for each bit rate of operation The two training sets were obtained from 
the 4 phoneme - specific sentence utterances of speech database 2 (Appendix B) by 3 
male and 3 female speakers The total duration of speech signal used corresponding 
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Bit Rate, kb/s 

5.2 

6.5 

8.0 

Analysis frame length, Nj 

160 

160 

169 

Embedding dimension, d 

10 

10 

10 

Local neighbourhood size, Ni 

40 

40 

40 

Excitation frame length, iV, 

20 

16 

13 

Excitation codebook size, Nc 

256 

256 

256 

Gam codebook size, Ng 

32 

32 

32 


Table 6.2: VELSP coder parameters at three bit rates ot operation 


to these sentence utterances is 50 .9 which provides lonooo samples at 8 kb/s sampling 
rate The training ratios for the excitation codebook at the 3 bit rates of 5 2, 6 5 and 
8 0 kb/s are 76 I, 95 0 and 116 8 respectively The training ratios tor the corresponding 
gain codebooks are 609 1, 760 3 and 934 8 respectively The initial codebook for the 
excitation sequences was designed using the method of pruning [52] Subsequently 
iterative refinement was performed using the generalized Lloyd algorithm The gam 

codebook was designed from its training set using the Lloyd algorithm tor empirical 
data 

Performance Study: The performance of the coder was studied using the 4 sentences 
(a)-{d) of database 2 (Appendix B) spoken by one male and one female This test 
data IS different from that used for the design of the codebooks The segmental 
SNR performance at the 3 bit rates is shown in fig 5 16 tor the 4 phoneme — specific 
sentences (a)— (d) and the overall database (dashed line) The segment length used tor 
the computation of the segmental SNR is 160 samples for 5 2 and 6 5 kb/s rates and 
169 samples for the 8 0 kb/s rate As expected, this measure of objective performance 
degrades with the lowering of the bit rate of operation 

The reproduced speech has perceptible noise and becomes poor at the 5 2 kb/s 
rate It must be mentioned that this performance is for the skeletal structure of the 
VELSP coder/decoder in which we have not incorporated any standard performance 
enhancing schemes. 
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Rate (Kbits/s) 


Fig. 5.16: Segmental SNR performance of a VELSP coding scheme at three bit rates of 5 2, 
6 5 and 8 0 Kbits/s. Graphs (a)-(d) are based on the phoneme specific sentences 
(a)-(d) respectively (speech database 2, Appendix B) spoken by one male and 
one female The dashed Ime gives the performances for the overall database 
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We also compared the objective and subjective performance of the fully quantized 
VELSP coder with unquantized forward adaphve and backward adaptive CELP coders 
without perceptual weighting The forward adaptive CELP coding scheme that was 
implemented is discussed in chapter 4 The backward adaptive CELP that was 
implemented is a block adaptive scheme in which the short term LP parameters for 
an excitation were computed from an analysis frame length of immediately previous 
reproduced speech values The parameters of the CELP coders were chosen such 
that when quantized, they would operate at approximately 5 2, 6 5 and 8 0 kb/s The 
following qualitative statements can be made about the comparisons 

(1) The segmental SNR performance of unquantized forward adaptive CELP coders 
IS 1 3 to 1 4 dB better than quantized VELSP coders at the three rates Similarly, 
the segmental SNR of unquantized backward adaptive CELP coders is 1 0 to 1 1 
dB more than quantized VELSP coders at the three rates Kroon and Atal have 
noted that the segmental SNR of unquantized CELP is approximately 2 dB more 
than that of a fully quantized one [84] 

(2) The subjective performance of a quantized VELSP coder is somewhat better than 
a backward adaptive CELP coder 

Directions for Further Study: The above preliminary investigation of a VELSP coding 
scheme suggests that it can be the subject of a more detailed investigation There 
are two major directions in which efforts are required These are 

(1) Improvement in subjective quality performance Instead of a simple m s e mini- 
mization, a perceptually weighted minimization criterion, as in CELP, can be incor- 
porated Further effort is required to reduce the perceptible noise in the reproduced 
speech by doing a detailed study of the characteristics of the noise and incorporating, 
for example, an adaptive postfiltering scheme 

(2) Reduction in computational complexity of the coder and decoder The major 
sources of computational complexity for one local state prediction are as given section 
5 3. For a VELSP coding scheme, in order to derive a gam and an excitation index 
corresponding to Ng speech samples, a full complexity coder will need to perform 
Ng X Nc local state predictions where Ng is the size of the excitation codebook 



5 Local State Prediction Coding of Speech 


179 


Likewise the decoder has to perform Ne local state predictions in order to get 
the correspondmg reproduced speech samples. While there are some obvious 
methods to reduce the computational complexity margmally, a detailed effort is 
required to investigate the possibility of bringing it down significantly 

In conclusion, we state that the above paradigm for low delay speech coding 
can be gainfully used when positive outcomes are obtamed with regard to the above 
two directions of further mvestigation 



Chapter 6 

The Rate Distortion Function and 
Computation of a Lower Bound 


Rate distortion theory is the discipline that deals with issues of signal compression 
from an information theoretic viewpoint The basic problem here can be stated as 
follows given a source distnbution and a distortion measure, what is the minimum 
rate at which mformation about the source can be reproduced with a specified 
expected distortion'? The foundation of rate distortion theory was laid by Shannon 
in 1959 in his paper, "Coding theorems for a discrete source with a fidelity criterion," 
where he defined the rate distortion funchon of an mformation source [127]. Prior 
to this, he had already introduced the concept of rate distortion in his celebrated 
paper, "A mathemahcal theory of communication," in 1948 [126] Specifically, the rate 
distorhon function, R{D), gives the minimum rate R at which information about a 
source can be transmitted subject to the constraint that it can be reproduced with 
an average distorhon D. The rate at which a source produces information subject to 
the requirement of lossless or perfect reproduction is the entropy of the source Rate 
distorhon function can therefore be looked upon as a generalization of the concept 
of entropy It may be worthwhile to keep in mind the basic communicahon system 
block diagram of fig 11 for the deliberations in this chapter 

According to the information transmission theorem, which is a generalization 
of the channel coding theorem, it is impossible to obtain an average distortion D 
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or less unless R{D) < C, where C is the capacity ot the transmission channel For 
memoryless sources, one can usually compute R{D) analytically if the source p d f 
IS known It can also be computed numerically from source output realizations 
usmg Anmoto and Blahut's algorithm [16], [17] For sources with memory, one can 
capitalize on the inherent statistical dependencies to further reduce the minimum 
rate needed to achieve a specified average distortion Even tor source outputs 
described by independent random variables, joint descriptions are more efficient 
compared to individual descriptions This is because rectangular grid points which 
arise naturally in independent descriptions do not fill the space as efficiently as lattice 
grid pomts m n-space Thus, it is meaningful to compute the rate distortion function 
from joint distributions of the source output process However, this is a difficult 
task Analytically, R{D) is known only for a few joint pdf's such as the jomt 
Gaussian density function and for specific distortion criteria [12] One can attempt 
to compute it from source output realizations using Anmoto and Blahut's algorithm 
but the computational effort and data length requirement increase exponentially as 
one considers statistical dependencies extending to successively larger time frames 

In this chapter, we consider the computation of a lower bound R^{D) of R{D) 
for stationary ergodic sources with memory Specifically, R^'{D) — R\{D) - A, where 
A = H{X) - H for discrete alphabet sources and A = h{X) - h for continuous 
alphabet sources [12], [151] Here, R\{D) is the first order rate distortion function 
and H{X) {h(X)} is the entropy {differential entropy} corresponding to the marginal 
probability (marginal probability density} function of the source output and { /i} is 
the entropy rate (differential entropy rate} with respect to the joint probability (joint 
probability density} function of the source output process R\{D) can be computed 
for both discrete and continuous alphabet sources from source output realizations 
usmg Anmoto and Blahut's algorithm Also, H{X) {h(X)} can be estimated using 
histogram techniques However, a direct computation of the entropy rate (differential 
entropy rate} based on an estimate of the joint probability (joint probability density} 
function IS prohibitively expensive on the data length requirement We will give 
algorithms for eshmating order - q entropy rates and differential entropy rates In 
this ordering of rates, H corresponds to the first order entropy rate and h is the first 
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order differential entropy rate respectively. We will be concerned with the estimation 
of lower bounds H 2 of H and /12 of h from source output reahzations which will 
then be used in the eshmation of R^{D) 

The organization of the chapter is as follows: In section 6 1, we give definitions 
and relevant lower and upper bounds of the rate distortion function for sources 
with memory. Both discrete and continuous amplitude sources are considered 
An algonthm to compute the first order rate distortion function, Ri{D), from source 
output realizations is given m section 6 2 Section 6 3 is concerned with the definitions 
of the order - q entropy rates and differential entropy rates and their estimation 
procedures usmg the generalized correlation sum Special attention is given to the 
estimation of the second order entropy rate H 2 and the second order differential 
entropy rate /i 2 - In section 6 4, we give examples of the estimation of H 2 and /i 2 
from random process realizations. Section 6 5 gives results of the computation of 
the lower bound R^(D) = Ri{D) + H 2 — H{X) of the rate distortion function R(D) 
for quantized speech sources 

6.1 The Rate Distortion Function: 

Definitions and Bounds 

Let us begin with a characterization of the source-user pair and a descnption of the 
notations used Consider an information 'source' S and a 'user' U For our purposes 
we give their characterization by the following 

(a) a source output alphabet set X and a reproduction alphabet set 3^ We will 
consider two types of sources ( 1 ) a discrete alphabet source for which X consists of 
a countable number of elements and ( 11 ) a continuous alphabet source for which X 
IS the real line 

(b) a 'distortion function' d X x y R^ from the source alphabet - reproduction 
alphabet pairs into the set of nonnegative real numbers 

(c) a probability law governing the source output It is assumed that the source 
output IS an infinite sequence of random variables {Xj}, -00 < t < 00 , e X which 
occur at the rate of /, per second We assume the source output is stationary 
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and ergodic and is governed by a set of consistent probability {probability density} 
functions p{xi, X 2 , ,x„), 1 < n < oo, x. € .V Although we have used the same 

notation for the joint probability function of a discrete alphabet source and the joint 
probability density function of a continuous alphabet source, the usage should be 
clear from the context For continuous alphabet sources we restrict to the case of 
absolutely contmuous joint probability distributions 

A sequence of random variables Xt, i will be represented as a 

random vector Xf 

= i]^ (61) 

Likewise a sample realization of Xf will be represented by xf 

xf = [t, x.n x,,d i]^ (6 2) 

Similarly, a d- dimensional reproduced random vector will be denoted by Yf and its 
sample realization by yf 

It must be remembered that signal compression is a deterministic process Its 
function IS to replace a sequence of source output symbols with a sequence of 
symbols from the reproducing alphabet in such a way that the new sequence has 
less entropy but at the cost of some distortion The same block of source symbol 
sequence will always produce the same block of reproducing symbol sequence But 
when one restricts attention to a single source output symbol (or a relatively small 
subblock) without knowledge of the previous or subsequent source output symbols 
or of that symbol's position within a block, then the reproducing symbol yk into 
which a source output symbol Xj is encoded can be thought of as a random variable 
even though at the block level the encoding is deterministic 

For a source output sequence of length n, the mutual information (defined in 
eq (6 10)) between the source sequence and the reproduced sequence follows the 
relation 

J(X^ Y^) = H{Y^) - H{Y^/X^) (6 3) 

where if(Y") is the entropy of the reproduced sequence of length n and //(Y™/X") 
IS the conditional entropy of Y" subject to X'“ Since signal compression is a 
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deterministic process, ir(Y"/X") = 0 Intuitively, we want to reduce the entropy 
H{Y^) of the reproduced symbol sequence subject to the condition on the allowable 
average distortion Thus, we would like to reduce the mutual information I{X^, Y") 
subject to this constramt. 

The general rate distortion function, R{D), with respect to the distortion function 
dn, IS defined by [12], [49] 

R{D) = lim Rr,{D) (6 4) 

TI-+00 

where Rn{D) is the order rate distortion function 


Rn{D) = 


- inf 
n p(y’'/x’')ePD 


J(X",Y") 


(6 5) 


Pp IS the set of conditional joint probability {joint probability density} functions 
p{y'^/yp) such that the expected distortion is < D. Thus, for discrete alphabet 
sources 

Pd = |p(y"/x") y”) < ^ I (6 6a) 

I X" y" J 

and for continuous alphabet sources, 

Pd = |p(y7ic”) £ J p(x’‘)p(yVx*) y^dy'dx" < d | (6 6i>) 

where is a distorhon critenon between sequences x" and y" It is called 

a single letter distortion criterion if 


dn(x”,y’*) = ^ y«) 

Z=l 


For example, the Hamming distortion is given by 


d{x,y) 


0 if X = y 

1 if X ^ y 


and the squared error distorhon is given by 


(6 7) 


(6 8) 


d(i, y) = (x-yf 


(6 0 ) 
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For discrete alphabet source-user pairs 


7(x'",Y^) = p(x'‘)p(y"A") log 

yn 


p(yVx-) 

Ex-'P(^”)p(y"/^'') 


(6 lOa) 


and for continuous alphabet source-user pairs 

r r v(v^^/x.^) 

= U. (6 106) 

R{D) IS a continuous, monotonic decreasing, convex function m the interval 
from D = 0 to D = where Dmax is given by R{D) = 0 for D > For 

each D G there is one and only one relative minimum of /(X",Y") in Pq 

This minimum value is R{D) and it always occurs at a point having an average 
distortion equal to D The inequality part of eqs (6 6a) and (6 6b) are operative only 
for D ^ D^(xx 

The analytical computation of R{D) is largely an intractable problem Therefore, 
one resorts to computing its bounds An obvious upper bound which is usually 
computed is Ri{D) corresponding to the marginal probability {probability density} 
function of the source alphabet For continuous alphabet sources, a tighter upper 
bound to R{D) based on the mean square error criterion is provided by the rate 
distortion function of a Gaussian source having the same first and second order 
moments [12] 

We are here concerned with the computation of a lower bound R^{D) of R{D) 
[12], [151] For a discrete alphabet source. 


R^{D) = Rx{D) + H - H{X) 


(6 11 ) 


where H{X) is the entropy of the source based on the marginal probability 

H{X) = — ^ p{x) logp(x) (6 12) 

X 


H = lim 

n-^'Oo Tl 

= - hm iVp(x")logp(x'‘) 

n-+oo Tl ' 


and H is the entropy rate 


(6 13rt) 
(6 136) 
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Similarly for a continuous alphabet source 


R^iD) = Ri{D) +h - h(X) 

(6 14) 

where h{X) is the differential entropy of the source based on 

the marginal density 

function- 

h{X) = - J p{x) logp(cc) dx 

(6.16) 

and h is the differenhal entropy rate 

h = hm -h{X^) 

n— voo Tl 

(6 16a) 

= - lim i / p(x")logp(x”)dx" 
rt— oo n 

(6 166) 


Successively tighter lower bounds of R{D) are obtained from Rn{D), the n^'*- order 
rate distortion function for increasing values of n However, we restrict attention 
to the computahon of R^[D), eqs (6 11), (6 14) The first order rate distortion 
function can be computed usmg Arimoto and Blahut's algorithm [16], [17] The 
entropy {differential entropy} can be computed by first estimating the probability 
{probability density} function using standard histogram technique Therefore we 
restrict attention to the estimation of entropy rate H for discrete alphabet sources and 
the differential entropy rate h for continuous alphabet sources Since we estimate 
the second order entropy rate and the second order differential entropy rate ho 
which are lower bounds to H and h respectively, we actually compute the following 
lower bound to R{D) 

R^{D) = Ri{D) + H2- H{X) (6 17a) 

for discrete alphabet sources and 

R^{D) = Ri{D) -f- /i2 - h{X) (6 176) 

for continuous alphabet sources 

In the following section, we review Arimoto and Blahut's algorithm for the 
numerical computation of the first order rate distortion function from source output 
realizations 
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6.2 An Algorithm for the Computation of the 
First Order Rate Distortion Function 

In this review of Anmoto and Blahut's algorithm for the computation of Ri{D) from 
source output realizations, we follow [16], [17] It is possible to use this algorithm 
to compute higher order rate distortion functions but the computational effort and 
data length requirement increase exponentially with the order We will restrict the 
following discussion to the computation of jR^,{D) for n = 1 for discrete alphabet 
sources only It can be appropriately extended tor continuous alphabet sources also 
Smce Ri{D) makes use of the marginal probability function only, we will do away 
with the notation of the explicit dependence of the various functions on n For 
example, p(x'‘) will be replaced by p{x), p{y'Vx'‘) t>y iKy/x), Ri{D) by R{D) etc 
compared to the definitions in section 6 1 The algorithm tollows from a set of 
theorems due to Blahut [16] which we present, without proof, as Facts below 

Fact 6.1: R{D) is a decreasing, convex, and hence continuous function defined in 
the interval 0 < < Dmox/ where 

D^ax = min'^p(x)d(E, y) (6 18) 

p(y) ^ 

In this interval, the inequality constraint on D is satisfied with equality 1 

D-max corresponds to the smallest average distortion that can be obtained with 
zero information transmission This is the maximum distortion that needs to be 
tolerated by a source Fact 6 1 tells us that the inequality constraint in the definition 
of R{D) can be replaced with an equality constraint in 0 < D < This allows 

us to accommodate the constraint with a Lagrange multiplier s and express R{D) as 

R{D,) = 

rma 
pivl^) 

(6 19) 




where now p{y/x) is unconstrained except that it must be a valid conditional prob- 
ability function For each choice of the Lagrange multiplier s, the minimum will be 
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achieved by some condihonal probabihty hinchon p*{y/x) In this way, the Lagrange 
multiplier s becomes the independent parameter Next, the mmimization problem 
IS enlarged into a double mimmizahon problem to facilitate the development of the 
algorithm 

Fact 6.2: The rate distorhon function R{D) can be expressed as a double minimum 


R(D) = sD + nun min 
p(y) p(y/^) 


- -r y 


p{y/x) 

p{y) 


s Y^p{x)p{y/x)d{x, y) 

X y 

(6 20 ) 


where 

^ y) (6 21 ) 

X y 

and p*{y/x) achieves the minimum For fixed p{y/x), the right side is minimized by 


p{y) = 

X 


For fixed p(y), the nght side is minimized by 


p(y/x) 


p{y)exp{sd{x,y)) 
EyP(y) exp(sd(a;, y)) 


(6 22 ) 


(6 23 ) 


The simultaneous condihons form the basis of the algorithm given m Fact 6 4 
Fact 6.3: The rate distortion function can be expressed m the form 


R{D) = max min 

3el-oo,0| p(y) 


sD- ^p(®)log^p(y)exp(sd(a;,y)) 
•c y 


(6 24 ) 


This follows as a corollary from Fact 6 2 By varying parameter s from -oo to 0, one 
can get the entire R{D) curve for 0 < D < Umax 


Fact 6.4: Let the parameter s < 0 be given and let A{x,y) = exp{sd{x,rj)) Let p°(y) 
be any inihal probability function such that all components are nonzero Also, let 
P’^^'(y) be given in terms of p''(y) by 


/*'(!/)=?' fa) E 


p{x)A{x,y) 

J2yP^{y)M-^^y) 


(6 25 ) 


X 
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Then, 


D{p^{y/x)) — ^ r oo 


(6 26) 


and 


I{p{x),p"{y/x)) R{D,) as r-^oo 


(6 27) 


where 


p'{y/^) = 


p’'{y)A{x,y) 


(6 28) 


Eyfiy)A{x,y) 

and {D„R{Ds)) is a point on the R{D) curve parametrized by ? ■ 

In eq (6 26) we have denoted the explicit dependence of D at the iteration on 
p'-(y/s) In eq (6 27), I{X,Y) has been denoted by I {jp{x) , {y I x)) to highlight the 
dependence on p{x) and p''{y/x) The application of Fact 6 4 to the computation of 
the rate distortion function is illustrated in fig 6 1 The termination of the algorithm 
IS based on Fact 6 5 given below It gives upper and lower bounds on R{D) and is 
helpful in estimating the residual error after each iteration 

Fact 6.5: Let the parameter 5 < 0 be given, and let A{x,y) = oxp(sfi(x, y)) Suppose 
p{y) is any reproduced alphabet probability function, and let 

A(x, y) 


cW=EpWvp(v)A(.,,) 


Then, at the pomt 


V „ i:yP{y)A{x.y) 


(6 29) 


(6 30) 


we have 


R{D) <sD- 'Y^p{x)\ogY^p{y)A{x,y) - Y^p{y)c{k) logc(A) (6 31) 

X y y 

and 

R{D) >sD-Y^ p{x) log p{y)A{x, y) - max log < (y) (6 32) 

X y 

m 

This completes the development of the algorithm steps We have used this 
algorithm to compute Ri{D) which is required in the expression of W'{D), eq (6 17) 
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A{x, y) = exp(5(i(x, y)) 


( \ ^ A{x,y) 

p{y) ^ p(2/)c(y) 

Tu = ^p{y) logc(?/) 


Tl = maxlogc(y) 


Tu — Tl < e 


p{ylx) = 


A(x, y)p{y) 
'EyA{x,y)p{y) 


D = Y2'^p{x)p{y/x)d{x, y) 

X y 

R{D) = sD - Y^p{x)p{y/x)d{x,y) 

X 

R{D) ^sd-y, p(®) log Y, ~ Y p^y^ c{y) 


HALT 


Fig. 6.1: Anmoto and Blahut's algorithm for the computation of first order rate distortion 


function 
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6.3 Computation of Lower Bounds of the 
Entropy Rate and Differential Entropy Rate 
using the Correlation Sum 

A direct method tor the numerical estimation ot entropy and ditterential entropy 
rates is to first estimate the joint probability or probability density functions used in 
their expression Howeven a major disadv*intti^je ot this method is the recjuirement of 
exponentially increasing data length .V as higher ottler joint probability {probability 
density} functions are considered One way ot leduung the effect ot this limitation is 
to estimate the entropy and differential entropy rates tiom the generalized correlation 
sum The method of correlation sum has been stiuiied and used in the estimation 
of dimension and metric entropy ot nonlinear dynamical systems (see eg [50], [112], 
[57], [58], [110], [31], also chapter 3} We will use an analogous scheme to estimate a 
lower bound of the entropy rate and exteiui it to estimate a lower bound /12 of 
the differential entropy rate 

We first consider the estimation ot order - 1 / entiopy rates ot a < fuhmuous alphabet 
stationary ergodic source with memory in the following subsection 1 hereafter, we 
discuss the estimation of the second order entropy rate H 2 ^ discrete alphabet 
source and the second order differential entropy rate h> tor a continuous alphabet 
source m succeeding subsections Ideally, // x lor a continuous alphabet source 
If the divergence of the entropy rate is avoided by considering a finite partition, the 
result IS a reflection of the choice of partition resolution I he necessity for considering 
a partihon resolution r away from 0 is due to the finite resolution and limited length 
of the time-series data available in a numerical or experimental situation 

6.3.1 Generalized Entropy Rates and their Estimation 
using the Generalized Correlation Sum 

Consider a discrete time information source S \chose outpuit sec|uence is governed 
by a probability law as given in section 6 1 I et V be the real line or a continuous 
subset of it To define the entropy rate H, partition the support .V" ot X" into 
R(r,n) = Ki,7r2, ,7rA/(r,„)] of Side Tcsolution r Here A[(i.ii} denotes the number of 
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partihons of the support of X" Let be the normalized probability measure of 
the partihon ttj Then 

Af(r,n) 

ir(X",r) = - ^ Prolog n = l,2, (6 33) 

j=i 

IS the entropy based on the n-dimensional joint density function at resoluhon r, 

H{r) = hm - i?(X", r) (6 34) 

n— >oo 71 

IS the entropy rate at partition resoluhon r and 


H = \im H{r) (6 35) 

IS the entropy rate of source S 

In order to consider the estimahon of a lower bound of H, we consider the 
more general "order-g Renyi entropy rate" or "generalized entropy rate", Hq [112], 
[116] The order-g entropy of dimension n at resolution r is given by 

M{r,n) 

J,(X",r) = 9 5^ 1 (6 36a) 

^ 1=1 

M(r,n) 

ifi(X",r)=F(X",r) = - ^ P.,logP,. (6 365) 

1=1 

Then, 

P,(r)= hm ip,(X",r) (6 37) 

n— >oo 71 

IS the generalized entropy rate at resolution r and 

Hq=limHq{r) (6 38) 

IS the generalized entropy rate 

Let US now consider the estimation of the generalized entropy rate from the 
generalized correlation sum There is an important approximation that relates the 
generalized correlahon sum to ^*1 (^ ^bove Let us consider this 

approximation and the subsequent relation of Hq to the generalized correlation sum 
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Let 5x^(r,n) represent a ball ot radius r around the point in .V", the support of 
X" Also, let Its probability be given by 

Bx-(r,n) = (m) (6 39) 


Given a sample realization xi,X2, ,x,v' of the source output, construct vectors 


x” = [X: X.4.1 Xttn-l]^. 1 = 1,2, ,N'-n+l 

For a point x", an estimate of Sx;(r,n) may be obtained from 



1 


N-1 




(6 40) 


(6 41) 


where N = iV' - n + 1, x;‘ is given by eq (6 40) and B(arg) = 1 tor arg > 0 and 
0(arg) = 0 otherwise The choice of the metric determines, for example, whether 
Bx»(r, n) is a n-dimensional sphere (L> norm) or a cube norm) etc However, 
Hg itself IS invariant to the choice of the metric 

Now let be the estimation of the probability of that partition ttj which 
contains the point x" If this partition contains points, then the important 

approximation is 


j5v» 


N, 


’^' 0 ) 


N 




(6 42) 


The idea behind this approximation is that most points m -n-jQ) will be within r of x" 
and although some points that are further away will be ignored, others that are close 
enough but in neighbouring partitions will be counted. The error in approximation 
IS only a factor of order unity [50] 


The estimated generalized correlation sum is given by 


1 ^ 

Cg{r,n,N) = X! [%('*’ ^) 
Cl (r, n, N) = lim Cg{r, n, N) 


q-l 


q 7^ 1 


(6 43a) 
(6 436) 


The following derivation shows the approximation between the generalized corre- 
lation sum and the probability of partitions It proceeds by substituting fo*" 



6 The Rate Distortion Function and Computation of a Lower Bound 


195 


Bx”(r,n) and replacing the sum over points t = 1,2, ,N with a sum over the 
parhhons and a sum over the x^, j = 1,2, ,N m each partition 

19-1 

C,{r,n,N) = - Y,[Bx^{r,n) 
j=i 

Af(r,n) N 

-nT, E [^..1’-' 

i=i j=i 


M(r,n) 

l=l 


M{r,n) 


= E 

i=i 

(6 44) 

Here Is (s) = 1 if s € 5 and 0 otherwise Comparmg eqs 
we have an estimate of the generahzed entropy rate 

(6 36)-(6 38) and eq (6 44) 

Hq{X^, r) = log Cg{r, n, N), 

q -- i 

qy^l (6 45a) 

II 

(6 455) 

Hg(r)= lim -ffg(X",r) 

n— ^oo Ti 

(6 46) 

Hg=\im Hg{r) 

^ r^Q ^ 

(6 47) 


Note that we have not mentioned the explicit dependence of the estimates on the 
data length N Alternatively, 


Hg = liin lim 

r— >0 n-^oo 


Hi = lim Hg 

g->l+ ^ 


1 


C,{r,n,N) 
q-l"° Cg{r,n + 1,N) 


log-^ 


97^1 


(6 48a) 
(6 486) 


The above equation can be used to estimate the entropy rate H by approaching the 
limit g 1+ However, we will estimate a particularly easy and special case of the 
order-g entropy rate, i e , for q = 2 The second order entropy rate Hy is a lower 
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bound of the entropy rate H,ie , H> > H For this case, the correlation sum is given 
by 


C(r,n,iY) = C2(r, n,jV) 


N \ 


w(w - 1) ^ 



(6 49 ) 


where 

Also, 


are given by eq (6 40) and C{r,7i,N) is the estimated correlation sum 


r .On .-v (y(r^ ri j- 1. N) 


(6 50 ) 


Some remarks on the above procedure are as tollovvs 


(1) Conjectured advantages of the correlation sum technique over the histogram 
technique in estimating the entropy rate: I he correlation sum ('{r,ri,N) computes 
the fraction of the total number ot pair of points of the torm x", i = 1,2, ,iV for 
specified dimension n, that are less than or equal to a distance r of each other The 
conjectured advantages of this estimation procedure are 

(a) Using the correlation sum technique one has access to N~ pair ot points as 
opposed to N points in the histogram technique Thus, in the former one can 
probe down to distance scales of 0{N “/”) i e., upto the smallest mterpoint distance 
whereas in the latter one can probe only upto 0(N for data length N and vector 
dimension n This is advantageous since the entropy rate has to be evaluated at 
small partition resolution r. 

(b) Consider the estimation procedure using both schemes at a fixed resolution r 
Using the histogram technique, estimates of P„, for the partition tt; that are outlying 
can be grossly inaccurate This is due to the finiteness of data length and can 
lead to an underestimate of H If we use the correlation sum technique, then the 
sum over the partitions is changed to a sum over points on the vector sequence 
x”, 1 = 1,2, ,N We find the fraction of the total number of pair of points which 
are within a distance r of each other Therefore, intuitively the inaccuracy due to 
finiteness of data will be less in this case 
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(2) Statistical properties of the correlation sum estimator: Throughout this work 
we are concerned with functions of the estimated correlation sum Therefore, it 
IS worthwhile to know the stahstical properties of the correlation sum estimator 
C(r, n, N) given by eq. (6 49) It eshmates 

C{r,n) = E[BAr,n)] (6 51) 


where 5x>>(r,n) is given by eq (6 39) More specifically, C{r,n) is the correlation 
function 

C(r,n)= [ c(A,n)dA (6 52a) 

J\\A\\<r 

c(A,n)= f / d/x(x")cl/x(x 2 )^”(x” — X 2 — A) (6 526) 

c{A,n) is the probability density for the difference between two random vectors x" 
and X 2 both chosen according to their probability measure. 

The statishcal properties of the correlation sum estimator have been investigated 
in the context of the estimation of dimensions and metric entropy in nonhnear 
dynamics. It is shown m [55] that if the sample data points {xi, X 2 , , xn} are 

independent, then the correlation sum estimator is unbiased. That is 


E 


C{r,n,N) \ = C(r,n) 


(6 53) 


Otherwise, if the data is autocorrelated, then the estimator is asymptotically unbiased 
However, the finite N bias can be corrected by a slight modification of the definition 
of the estimator [134] 

Cir, n. N) = f: e (r - (xf - ||x”|l) (6 54) 

where kF is a small posihve integer value Theder has shown that the correlation sum 
estimator is a consistent estimator of the correlation function [138] The standard 
deviation cr^ of the estimator generically scales as 0(1/ y/N) for iV oo Those 
sources for which the pointwise mass function, eq (6 39), remains constant, i e , 


(6 55) 


S?m5 = E[B - E{B)]‘ = 0 
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cTfj scales as 0(1 /N) for N oo 

(3) Implementation Aspects and Choice of Resolution Scale: Given a source output 
realization x., ^ = 1,2, ,N' the implementation consists of the following steps 

(a) Construct sample vectors x", z = 1,2, (N = N' - n -F 1) and for n = 1,2, 

usmg eq (6 40) 

(b) At fixed resolution r, obtain C{r,n,N) for increasing values of n using eq (6 49) 

(c) Plot Hzir, n) vs n where 


H 2 {r,n) = log 


C(r,n,N) 
C(r, n -1- 1, N) 


(6 56) 


The quantity H-iir.n) should converge with increasing values of n which is read off 
as the estimated value H 2 (r) of the second order entropy rate Hzir) 

Ideally, Hoir) is evaluated at r -+ 0 limit However, there are limitations in 
approaching this limit A lower limit to r is set by the accuracy of the data At 
distance scales of the order of the smallest interpoint distance, the correlation sum 
scales as a collection of isolated points Further, at small scales, the correlation sum 
may be inaccurate due to the presence of additive noise On the other hand, at large 
scales the correlation sum tends to saturate Thus, the distance scale r should be 
chosen between these extremes. Preferably, H>{r) should be computed for different 
values of r in the intermediate range In the examples of section 6 4 it is seen that the 
variation of the estimate is small over a large range of values of r For more details 
on the implementation aspects of the correlation sum algorithm, the reader can see 
section 3 7, where the discussion is in the context of the estimation of correlation 
dimension and second order metric entropy from time series 


6.3.2 Second Order Entropy Rate for a 
Discrete Alphabet Source 

Consider the discrete alphabet source S described in section 6 1 Let ■V = {rtl,a2, ' 
aj},\X\ = J. Let us now represent the source alphabet set by an integer valued 
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set Z = {i 1 < i < 1^1} such that a source output sequence {a;,, -oo < i < oo} is 
represented by {zt, — oo <t < oo} where = k if — ak, 1 < k < J Thus, 


p{xi = a^,X2 = aj, ,Xn = ak)=p{zi=t,Z2=j, ,Zn = k), 1 < n < oo, a:. € .T 

(6 57) 

This transformation of the alphabet set does not change the generalized entropy rate 
because it is coordmate invariant. The second order entropy rate is estimated using 


JT2(r) = lim log 


C{r,n,N) 
C{r, n + TN) 


H2 = Urn H2{r) 

r—^l 


(6 58) 
(6 59) 


Here C(r, n, N) is given by eq (6 49) In section 6 4, we give results of the computahon 
of H 2 for two specific discrete alphabet stochastic processes 


6.3.3 Second Order Differential Entropy Rate for 
Continuous Alphabet Source 

Analogous to the development in section 6 31, we now consider the estimation of 
a lower bound /12 of the differential entropy rate h, given by eq (6 16) We begin 
with what we call the "order-g differential entropy rate" hq Consider a continuous 
alphabet source S as described in section 6 1 The order-g differential entropy for a 
n-dimensional random vector X'^ = [Xi,X 2 , is given by 

^9(X”) = f [p(x")]^ dx”, q>0, I (6 60a) 

/m(X") = ^(X") = [ p(x") logp(x")dx" (6 605) 

Vx" 

provided the integrals exist The order-g differential entropy rate is then defined as 

hq = lim -/iq(X"), g > 0, g 7^ 1 (6 61a) 

hi = h = lim -h{X^) (6 616) 

n—^oo Tl 

Renyi has proved a limiting relation between the order-g entropy and the order-g 
differential entropy by considering the continuous distribution of the random vector 
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X"- to be a limiting case of the discrete distribution ot an associated random vector 
X;i (see [116], Appendix) 

Consider an n-dimensional random vector X" = [Xi,X>, with an ab- 

solutely continuous distribution Let p(xi,x>, ,x„) be the corresponding density 
function of X'* Let X" = [Xi^, Xor, , be an associated discrete random vector 
where X,r = r [^] , i = 1,2, ,n (Here [y] denotes the largest integer < y) If 
£r(X",r) IS finite for r = 1, then 

/i,(X") = Lim[F,(X;*) + nlogr], y > 0, (6 62a) 

where i?,(X", r) is given by eq (6 36a) and /!,(X”) is given by eq (6 60a) and 

hi (X") = hm [Hi (X" , r ) -f n log r] (6 62fe) 

r ♦() 

where Hi{X^,r) is given by eq (6 36b) and /ii(X") is given by eq (6 60b) 

This result can be extended to obtain a relation between the two entropy rates 


hq = lim[JT,(r) + log r ], y > 0, y 7 ^ 1 (6 63) 

where Hq{r) is the order-y entropy rate ot X'‘ at resolution scale r, eq (6 37) and hq 
IS the order-y differential entropy rate ot X", eq (6 61) 

To obtain an estimate hq of hq,q > 0 we compute the right side ot eq (6 63) 
Hq{r) IS estimated by Hq{r) using the generalized correlation sum at a distance scale 
r Thus, 

hq = \im{Hq{r) -h log r], y > 0 (6 64) 

We will estimate hq (ot q — 2 The second order differential entropy rate h> is a 
lower bound to the differential entropy rate hi The estimate h> is given by 


/i2 = lim 

r -~+0 


H>(r) H- log r 

Un, ,<,g fClr, n.N) 
^-*0 C{r,n + 1,N) 


(6 65) 


where C(r, n, N) is given by eq (6 49) 
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The choice of the resolution scale r and the implementahon steps discussed in 
section 6 31 carry over to the estimahon of h 2 - Step (c) and eq. (6 56) are modified 
to the followmg 


Obtain a plot of A 2 (r,n) vs n where 


h 2 {r, n) = log ^ 


rC{r, n, N) 


C{r,n + hN) 


(6 66 ) 


The quanhty h 2 {r,n) should converge to h 2 {r) with increasing values of n and for 
fixed r which is then read off as the estimated value Ao of the second order differential 
entropy rate 


6.4 Results of the Estimation of Second Order 
Entropy and Differential Entropy Rates from 
Time Series Realizations 


We consider two examples each of the estimation of the second order entropy rate 
H 2 and the second order differential entropy rate /i 2 using their respective estimates 
based on the correlation sum In each case the Euclidian distance is used as the 
metric in the estimation of the correlation sum in eq (6.49) 

1. Discrete Alphabet i.i.d. Random Process: Consider an 1 1 d random process 
{Xi} whose random variables are drawn from a discrete alphabet set X = {1,2, 3, 4} 
Let p{i) = i, 1 G X We have 


(a) Entropy rate H = 20 bits/sample 

(b) Estimated value H 2 {t) = 1 88 ± 0 03 bits/sample at = 2 0 over 8 realizations 

(c) Estimated value F 2 = 1 80 ± 0 10 bits/sample over 4 values of from 10 to 4 0 
at intervals of 1 0. 

2. Discrete Markov Cham: Let {X^} be an 8 state stationary Markov chain as shown 
in fig 6 2 Its probabihty transition matrix P has entries given by 

P^J =p{Xn+l=3lXn=l)=[y^^ if iGgJ = ±1, G (0,1, ,7} 


For a Markov chain 


H = hm i^CX") = hm H{XjX^-u 

n-ioo n Tl-^OO 


(6 68 ) 


Xi) = ff(X2/Xi) 
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6.2: An 8 state Markov chain for example 2, section 6 4 
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We have- 

(a) i? = 1 5 bits/sample. 

(b) H 2 {r) = 1 34 ±0 03 bits/sample at = 2 0 over 8 realizations 

(c) H 2 = 1 31 ± 0 08 bits/sample over 8 values of from 1 0 to 8 0 at intervals of 1 0 

Figure 6 3 shows the convergence of H 2 ir, n) with increasing dimension n for one 
realization each of examples 1 and 2 

3. Gaussian Distributed i.i.d. Random Process: Let {cji} be a white Gaussian noise 
process where en ~ A/’(0, 1) For such a process the differential entropy rate is given 
by h= ilog2(27recr^) (see eg [12], [32]) Thus, we have 

(a) Differential entropy rate h = 2 047 per sample 

(b) Estimated value h 2 {r) = 1 85 ±0 02 per sample at r = 03<re over 8 reahzations 

(c) Ao = 2 03 ± 0 17 per sample over 10 values of r from 0 lo-g to 1 Ocr^ at intervals of 

0 IcTg 

4. Jointly Gaussian Distributed Random Process: Let X, be a jointly Gaussian dis- 
tributed random process with n-dimensional random vectors X" = [Xi,X 2 , 
distributed as jV*a(m, where m is the mean vector and Kn is the n x n covariance 
matnx The differential entropy rate of the process is given by 

^ J log(27re) + lim ^ log |ii:„| (6 69) 

Z n— ►oo Zn 

where \Kn\ is the determmant of Kn [12], [32] 

Consider a zero mean Gaussian random process with 

( cr^, A: = 0 

J5'[XjXt+fc] = < aaj(, k = ±l (6 70) 

0, otherwise 

For cr^ = 2 0 and a = 0 5, we have 

(a) h = 2 047 per sample 

(b) h 2 {r) = 1 87 ± 0 02 per sample at r = 0 3<7e over 8 realizations 

(c) A 2 = 2 16 ± 0 21 per sample over 10 values of r from 0 lo-e to 1 at intervals of 

0 IcTg 
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Fig. 6.3: Graph of Hy{r,n) vs n, eq (6 56) The estimation is done for one realization 
each of examples 1 and 2, section 6 4 



Fig. 6.4: Plots of hi{r,n) vs n, eq (6 66) for 4 different distance scales, r = 0 

0 6<re and 0 8<7e for example 3, section 6 4 Very little variation in h-i is observed 
as a function of r 
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A data length N = 40000 was used for each realization in parts (b) and (c) m all 
the 4 examples above 

It must be emphasised that we have computed a lower bound to the entropy and 
differential entropy rates m the above examples. Nevertheless the estimated values 
are within 11% of the respective theoretical values of the rate. As the examples 
show, the variation of ff 2 and ho is small over a fairly large range of resolution scales 
r Figure 6.4 shows the variation of Ho{r,n) vs n for 4 different resolution scales 
r = 0 2cre, 0.4<re, 0 6<Te and 0 ScTg for example 3 Thus, while the choice of r may not 
be an important factor in the estimation procedure, we suggest that the estimate be 
computed for a range of resoluhon scales and reported as an average. 

6.5 Computation of a Lower bound of R(D) 
for Quantized Speech Source 

Vanous rate distortion functions for speech sources have been proposed previously 
These include the first order rate distorfaon function based on adhoc speech probabil- 
ity models such as the Gaussian and Laplacian density functions, the rate distortion 
function based on the assumption of a Gaussian AR source using the mean squared 
error distortion cntenon and those based on perceptually significant distortion mea- 
sures See eg [2], [122], [77] 

In this section, we give results of the computation of the lower bound R^{D) of 
the rate distortion function R{D) for quantized speech sources with respect to the 
mean squared error distortion criterion. (Note that in a stnct sense the algorithm is 
not applicable because speech is a time varying source) We have 

R^{D) = Ri{D) FH 2 - H{X) (6 17a) 

where Ri{D) is the first-order rate distortion function, eq (6 5), H{X) is the first 
order entropy, eq. (6 12), based on the marginal distribution of the quantized speech 
source and H 2 is the second order entropy rate given by eqs (6 36)-(6 38), and 
estimated from hme series data using eq (6 50) 

The speech database is as described in Appendix B (database 2) It consists of 4 
phoneme specific sentences spoken by 2 males and 2 females The speech signal was 
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prefiltered at 3 9 kHz and sampled at 8 kHz at a resolution ot 16 bit&/sample A total 
of 39 96s duration of speech data has been used tor this work 

The lower bound R^{D) of R{D) has been computed tor quantized speech at 3 
resolutions of 6, 8 and 10 bits/sample The first order rate distortion tunction, R^{D), 
has been computed using Arimoto and Blahut's algorithm Figure 6 5 shows graphs 
of the first order rate distortion function tor the 3 quantization levels with respect 
to the squared error distortion criterion The first order entropy, H{X), has been 
computed by approximating the probability tunction with histograms See Table 61 
for the results of the estimation ot H{X) and H> Figure 6 6 shows the convergence 
of Hiir.n) with increasing dimension n obtained tor two sentence utterances - one 
each by a male and female speaker Finally, R^‘[D) has been plotted as rate vs SNR 
m fig 6 7 for quantization resolutions ot 6, 8 and 10 bits/sample It is seen that the 
graphs for different speech source resolutions intersect each other This is because 
two contradicting factors determine the rate A higher source resolution implies that 
greater information needs to be transmitted However, higher quantization resolution 
also gives the scope for further removing the redundancy between the source output 
symbols thereby reducing the rate based on the given distortion criterion 

The above example illustrates the procedure tor computing lV'{D) However, 
to evaluate the performance ot a low bit rate speech coder with respect to a lower 
bound, it IS more appropriate to use a perceptually significant distortion criterion 


Source Resolution 

Entropy, H{X) 

Second Order Entropy 

(bits/sample) 

(Kbps) 

Rate, //-(^ 10’) 

6 

21 2 

1 34 

8 

35 35 

I 47 

10 

50 84 

220 


Table 6.1* (1) The estimated entropy using histogram technique, and (2) the estimated 
second order entropy rate, for quantized speecn sources at 3 resolution scales 
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SNR (dB) 

Fig. 6.5: Graph of Ri(D) with respect to the mean squared error distortion critenon for 
quantized speech sources usmg Blahut's algorithm Note the logarithmic scale of 
the x-axis the average distorhon is plotted as SNR (a) Quantization resoluhon 
= 6 bits/sample, (b) 8 bits/sample, and, (c) 10 bits/sample 



Fig 6.6: Plots of Hzir.n) vs n showing the convergence of the entropy rate with 
increasing dimension (a) Female speaker Sentence utterance Why were you 
away a year, Roy? Data length = 20480 samples at 8 KHz, Quantization 
- 8 bits/sample, (b) Male speaker Sentence utterance Nanny may know my 
meaning Data length = 14016 samples at 8 KHz, Quantization = 10 bits/sample 
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6.7: Graph of R^{D), eq. (6.17a) with respect to the mean squared error criterion 
for quantized speech sources. The average distortion D is plotted as SNR- (a) 
Quantization resolution = 6 bits/sample, (b) 8 bits/sample, and, (c) 10 bits/sample. 



Chapter 7 
Conclusion 


We have done a preliminary study of a nonlinear determmistic framework for 
speech signal modelling for coding applications Most medium and low bit rate 
speech coding schemes model the signal as an output of a time varying linear filter 
excited by a source. A large portion of the recent research effort has been directed 
to the design of appropnate excitation functions rather than mvestigate alternative 
model forms However, this research paradigm for speech codmg may now be 
approaching a stage of saturation A major purpose of our study is to replace the 
ubiquitous hnear filter by a more general nonlinear representational form Towards 
this goal, we first did a state space based analysis study of speech signals Just as a 
correlation analysis helps in a linear modelling exercise, the eshmahon of dynamical 
invariants such as dimension, metric entropy and Lyapunov exponents from time 
senes is helpful in building nonlinear deterministic state space models of the time 
senes We recapitulate the salient features of this analysis study and the results 
obtamed for speech signals in the following points: 

(1) The scalar hme senes is embedded in a reconstructed state space as a reconstructed 
trajectory This reconstruction is based on Takens' theorems which state that 
the dynamical invariants computed from the reconstructed trajectory will be the 
same as those of the original dynamical system whose time evolution is observed 
in the form of the scalar time series 
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(2) We have used two optimality catena, namely the singular value decomposition 
method and the redundancy method to reconstruct speech tra)ectories, plot their 
2-d projections and make observations trom them 

(3) Lyapunov exponents give a coordinate independent measure of the local stability 
properties of a state space trajectory They categorize bounded trajectories 
into equilibrium points, periodic solutions, quasiperiodic solutions and chaos 
We have computed the largest Lyapunov exponent ot reconstructed trajectories 
of phoneme articulations and have compared the results with those obtained 
from synthetically generated periodic and quasiperiodic data From the results 
obtained and the comparison tests, we conclude that reconstructed trajectories 
can be distinguished from periodic and quasiperiodic behaviour in that nearby 
speech trajectories exhibit exponential divergence on the average 

(4) A correlation dimension analysis ot speech shows that it is largely a low di- 
mensional signal We have also computed the correlation dimension trom a 
simplified statistical model of a particular vowel utterance The "dimension" 
attribute of a time series is helpful in a state space modelling exercise because 
it gives the necessary and sufficient number of state space variables needed to 
model the data We have qualified the dimension results with a study of the 
various sources of error affecting the estimates 

(5) Metric entropy is another dynamical invariant which quantifies the average rate 
of loss of information about the state of a dynamical system as it evolves in time 
For completely predictable systems, K = 0, while for chaotic systems, 0 < AT < oo 
Ideal random behaviour is characterized by AT = oo The relevance of metric 
entropy m the context of state space modelling is that it is inversely proportional 
to the average time over which a dynamical system (or a time series model) can 
be predicted from a given initial condition We have computed the second order 
dynamical entropy of speech which is a lower bound of the metric entropy The 
positive value of the second order entropy and the largest Lyapunov exponent 
both give evidence of the average divergence of nearby speech trajectories in 
the reconstructed state space. This observation allows us to distinguish speech 
signals from strictly periodic or quasiperiodic behaviour 
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Based on the dynamical analysis results we have investigated some nonlinear 

prediction schemes for speech signal modelling and coding in a state space framework 

The results of this study are summarized m the followmg points: 

(1) As a first choice, we have studied the (quadratic) polynomial representation 
form This is because it is a simple extension of a linear model and the optimal 
set of coefficients, m the sense of minimum m s e., can be obtamed by solving a 
set of simultaneous linear equations The basis of comparison with short term 
Imear predichon (LP) m terms of segmental SNR is the number of coefficients in 
the two predictor models We have principally considered two ordermg schemes 
for selection of model terms of the quadratic predictor In the first method, we 
exhaust all possible terms upto a certain time lag before considering terms which 
include signal dependence upto greater lags The second method is based on 
orthogonal term search from a set of candidate terms While the first method 
does not perform as well as a LP scheme in terms of segmental prediction gam, 
the second method of terms arrangement gives a modest improvement over 
LP for the same number of coefficients It is worth nohng that in a similar 
study of quadratic predictors [139], the basis of comparison with LP is the hme 
delay upto which signal correlations are considered rather than the number of 
model coefficients In this case, the quadratic predictor performs significantly 
better than a LP Another reported observation m this case is that the short term 
quadratic predictor is also capable of modelhng the pitch period redundancy to 
a great extent 

(2) We also studied the prediction properties of a local prediction scheme in state 
space In this scheme, the representation form is optimized over a local volume 
m the state space where the predichon is to be done The Local State Predichon 
(LSP) scheme is studied in terms of segmental predichon gam, plots of the 
prediction error sequence, their spectrum and autocorrelahon function and is 
compared with the error sequence resulhng from (i) short term LP, and (ii) short 
term plus long term LP In LSB an appropriately chosen neighbourhood will 
contain trajectory points that are close to the "target" vector m time as well as 
those which are approximately an integral number of pitch periods away Thus, 
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a LSP attempts to simulate the functions ot both short term and long term linear 
prediction simultaneously The performance of a local linear prediction scheme 
can be broadly categorized as lying between a short term and short term plus 
long term LP m the above terms It must be noted that this comparison is unfair 
to the LSP because a LSP uses previous speech values to predict the present 
whereas in LP, the model coefficients are optimally estimated tor a block of data 
and thereafter used tor prediction within the block 

(3) We also propose a framework for low to medium delay speech coding in the 
medium bit rate range based on LSP It is an analysis - by - synthesis coder 
structurally similar to CELP and named as a Vector Excited Local State Prediction 
(VELSP) coder The following points highlight the coding scheme and bring out 
the differences with CELP 

(i) LSP is used instead of LP The LSP is performed using previous reproduced 
speech which is available to the decoder as well 

(ii) A single excitation codebook designed from empirical data is used instead 
of two separate codebooks as in CELP taking advantage ot the predichon 
property of LSP 

(ill) A LSP based coder is naturally suited to low delay coding 

(iv) Since a LSP based coder is nonlinear, we give a method to incorporate the 
gain factor. 

We have implemented and studied a basic form of the coder structure at 52, 
6 5 and 8.0 kb/s which require excitation sequences of lengths of 20, 16 and 13 
samples respectively. While the segmental SNR performance is similar to CELP, the 
reproduced speech has perceptible noise and becomes poor at the 5 2 kb/s rate 

We also propose an algorithm for the computation of a lower bound ot the rate 
distortion function for stationary ergodic sources with memory Both discrete and 
continuous alphabet sources are considered The lower bound is more tractable than 
the rate distortion function itself The difficult part in computing the lower bound 
IS in the estimation of the entropy rate / differential entropy rate using histogram 
technique. This is because of an exponential increase in the data length requirement 
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as statistical dependence for larger time frames is successively considered Specifically, 
we have given an algorithm for the computation of the second order entropy rate 
for discrete alphabet sources using the correlation sum technique which is a lower 
bound of its entropy rate. The algorithm is then extended to compute the second 
order differential entropy rate of a continuous alphabet source which is a lower 
bound of its differential entropy rate Examples are given to show the efficacy of 
the estimation scheme Fmally, we compute the lower bound with respect to the 
mean square error distortion cntenon for quantized speech sources. 

Some suggestions for further work with regard to the problems studied by us 
are as follows 

(1) In the first problem of dynamical analysis of speech signals, it will be helpful 
to estimate the entire Lyapunov exponent spectrum rather than just the largest 
Lyapunov exponent While the largest Lyapunov exponent is sufficient to 
determine the average exponenhal divergence of nearby trajectories in state 
space, a computation of the entire spectrum will provide deeper understanding 
of the human speech process and additionally help venfy the metric entropy 
results 

(2) One can study the suitability of other nonlinear representational forms for speech 
prediction and coding in terms of predichon performance and computahonal 
complexity 

(3) We have implemented and studied a skeletal structure of our proposed VELSP 
coding scheme The two areas that require further investigation are the improve- 
ment m the quality of reproduced speech and the reduction in the computational 
complexity of the coder and decoder 

(4) We have proposed an algorithm for the computation of a lower bound of the rate 
distortion function for stationary ergodic sources with memory This is based 
on the conjecture that the correlation sum technique will give better estimate 
than the histogram technique m the computation of the entropy rate This 
conjecture is based on the intuihve arguments presented in chapter 6 which 
have also been the basis for the computation of generalized dimensions and 
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metric entropy using the generalized correlation sum rather than the histogram 
technique in dynamical systems literature The veracity of these conjectures 
should be investigated mathematically 

The theory of nonlinear dynamical analysis and deterministic state space mod- 
elling of time series has been used for various applications These include the 
study of fluid flows, sunspots, mechanical vibrations, economic data, climatic data 
and weather prediction There has been a considerable interest in the recent past 
to analyse and model biomedical signals such as EEC and KC'G using the tools ot 
nonlinear dynamics This intuitively appears to be an appropriate paradigm for the 
investigation of biomedical signals which are essentially observables of "complex" dy- 
namical systems While several preliminary lesults are available, this is a potentially 
interesting avenue of research 
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The purpose of this appendix is to briefly review dynamical systems terminology 
used in the thesis. For this, we have drawn from [34],[35],[36], [37],[38],[72] and 
direct the reader to these references for more elaborate discussion. 

Dynamical systems are described both m terms of the real space and arbi- 
trary manifolds In the real case, an nth-order autonomous dynamical system with 
continuous hme evolution is defined by the state equahon 

s = g(s), s(fo) = So (A 1) 

where s G i?" is the state at time f, and g i?" i?” is called the vector field, g 

associates a tangent vector with each point m the state space A parhcular solution 
of (A 1) with inihal condihon So at time t = to is called a trajectory and is denoted by 

So) The mapping is called the flow of the system In the discrete 

time case, the dynamical system is defined by a map and the flow is 

given by where n indexes the discrete time 

Dynamical systems can also be described in terms of manifolds Let the state 
space of a dynamical system be a compact manifold M. A dynamical system on M is 
a map M M (discrete time) or a vector field (continuous time) The vector field 
associates each point in the manifold with an element of the tangent space at that 
point The time evolution of the dynamical system is given by the flow (f>{t,so) The 
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process of observing the time evolution ot a dynamical system induces an observable 
which IS a "smooth" function h M R 

We are interested in classifying the steady state behaviour ot dynamical systems 
through invariants In particular, the steady state refers to the asymptotic behaviour 
as t 00 A point p is a limit point ot s it <:>(f.s) converges to p as t > do The set 
of all limit points of s is called the limit st-f, L(s), ot s A set L is invariant under 
It, for all s G L and all f, o(s) € L Limit sets are closed and invariant under (j) A 
limit set L IS an attracting limit set or attractor it there exists an open neighbourhood 
U oil such that d(Ls) € b' V t and o(f,s) . L as f - dc The domain or basin of 
attraction, B{L), is the set of all initial conditions so satisfying the above definition 
The classification ot dynamical systems into dissipative and conservative systems 
comes from physics This is an extension ot the notion ot energy to a more general 
context A dynamical system in which state spac e volumes are asymptotically invariant 
under the dynamics is conservative Any dynamical system that is not conservative is 
dissipative Conservative dynamical systems do not have attractors For dissipative 
dynamical systems, there are two possibilities in the steady state either the trajectory 
IS unstable and goes to infinity or it remains bounded, and approaches an attractor 
Nonattracting limit sets cannot be observed in physical systems or simulations 
Four different types of steady state behaviour ot dynamical systems are 

1. Equilibrium point or fixed point: An equilibrium point s, ot a dynamical system 
(A 1) IS a constant solution, (j){t,Se), for all t The limit set tor an equilibrium point 
is the point itself 

2. Periodic solution: Sp) is a periodic solution if 

Sp) = </)(t -H T , Sp) 

for all t and some minimal period T > i) A periodic solution is isolated if 3 a 
neighbourhood of it that contains no other periodic solution An isolated periodic 
solution Sp) IS called a limit cycle, and can occur only m nonlinear systems The 
limit set of a limit cycle is the closed curve traced by (P{t,Sp) over one period, and 
is diffeomorphic to a circle 



Appendix A Dynamical Systems Terminology 


217 


3. Quasiperiodic solution: A quasiperiodic solution is a sum of periodic functions each 
wifri minimal period T, and frequency /, = 1/T,. Also, there exists a finite set of base 
frequencies which are Imearly independent and form a finite integer base for the 
frequencies /, A quasiperiodic soluhon with p base frequencies is called p-periodic 
It possesses a limit set that is diffeomorphic to a p-torus. 

4. Chaos: We will restnct to a qualitative discussion only. Chaos is the bounded, 
aperiodic soluhon of a dynamical system It is a result of the twin properhes of 
stretching and folding Stretching, which is a local property, causes the exponential 
divergence of nearby trajectories This is also referred to as sensitive dependence on 
initial conditions Given two imhal condihons arbitrarily close to one another, the 
corresponding trajectories diverge at a rate characterishc of the system Also, since 
the motion is bounded in the state space, the trajectories exhibit folding Thus, 
trajectories are locally unstable since nearby trajectories separate exponentially but at 
the same hme, they are globally stable, smce they evolve on an invariant set The 
mvariant limit set is not a simple geometrical object like a circle or a torus, but is 
related to complicated, self-similar structure called fractals, which are characterized 
by noninteger dimension Chaotic trajectories relax to strange attractors Strange 
attractors are attractors with the addihonal property of sensihve dependence on 
inihal condihon 
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The experimental results documented in the thesis .11 e based on either ot two types 
of speech databases The first database is based on phoneme articulations ot the 
International Phonetic Alphabet and has been predominantly used m the dynamical 
analysis work of chapters 2 and 3 I he second is based on phoneme specific 
sentence utterances and has been used tor the predictive modelling, coding and rate 
distortion function estimation results ot chapteis 4, 5 and b VVe elaborate on these 
two databases in the following two sections 

B.l Database 1 — Phoneme Articulations 

In any language, most of the sounds can be grouped into families, each family 
consisting of an important sound ot the language together with other related sounds 
which represent it in particular sequences or under particular conditions of length, 
stress or intonation It is to such a family that the term phoneme is normally applied 
The sounds included in it are termed as its memheti, or allophoneb 

There are a variety of basis for the study ot phonetics eg linguistics, acoustic 
theory of speech production etc [8], [134] In the latter category, phonemes can 
broadly be classified as either a continuant or noncontinuant sound Continuant 
sounds are produced by a fixed vocal tract configuration excited by an appropriate 
source, eg vowels, voiced and unvoiced fricatives, nasals etc The remaining 
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sounds such as plosives, stops etc. which are produced by a changing vocal tract 
configuration are termed as noncontinuants 

The International Phonetic Alphabet (IPA), which is periodically revised, is often 
used for reporting expenmental work based on phonemes A simple classification 
scheme for consonants is known as "classification by place and manner" "Place" 
stands in for "place of production" meaning thereby the pomt of articulation of the 
segment "Manner" is the short form for "manner of production" which primarily 
means the type of stricture made by the articulation process to produce the utterance 
The interested reader is referred to [135] for a detailed descnption of this classification 
scheme for phonemes Table B 1 fists the 57 consonants of the IPA by place and 
manner of articulation and 8 cardinal vowels which together form our speech 
database 1 Wherever two phonemes are given the same identification, the latter 
symbol represents a voiced articulation. 

The speech database consists of the recordings of 57 consonant articulations by 
three male and one female speaker corresponding to the IPA chart (revised upto 1989) 
and cardmal vowel articulations, spoken 4 times each by one person (Daniel Jones) t 
Since all the phonemes cannot be spoken in isolation, one associates a phonetic context 
to mean the sounds next to it or near it m which the phoneme is a part All the 
consonant articulations of the database are followed by /a/ while the cardinal vowel 
articulations are sustamed utterances Each of these utterances were lowpass filtered 
at 7 5 kHz and sampled at 16 kHz at 12 bit resolution The consonant articulations 
were isolated from the utterances by a visual inspection of the corresponding time 
series Since plosive utterances are transient m nature and necessarily of extremely 
short duration, we have excluded them from the above isolation process The 13 
plosives of the IPA have therefore not been used for any analysis work reported in 
the thesis Thus, a total of 208 phoneme segments comprising of 44 IPA consonants 
spoken by 4 persons and 8 cardinal vowels spoken 4 times by one person have been 
used in the experiments 


t The database was provided by the Phonetics and Spoken English Department 
of the Central Institute of English and Foreign Languages, Hyderabad, India. 
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S.No. 

Phoneme Type 

Symbol 

1 

Bilabial Plosive 

P 

2 ; 

Bilabial Plosive 

b 

3 

Alveolar Plosive 

t 

4 

Alveolar Plosive 

J 

5 

Retroflex Plosive 

f 

6 

Retroflex Plosive 

i 

7 

Palatal Plosive 

c 

8 

Palatal Plosive 

4 ^ 

9 

Velar Plosive 

k 

10 

Velar Plosive 

'J 

11 

Uvular Plosive 

d 

12 

Uvular Plosive 

X 

q 

13 

Glottal Plosive 

? 

14 

Bilabial Nasal 

m 

15 

Labiodental Nasal 

0 ) 

16 

Alveolar Nasal 

n 

17 

Retroflex Nasal 

a 

18 

Palatal Nasal 

p 

19 

Velar Nasal 

1 

20 

Uvular Nasal 

N 

21 

Alveolar Trill 

r 

22 

Uvular Trill 

R 

23 

Alveolar Tap 

r 

24 

Retroflex Flap 


25 

Bilabial Fricative 


26 

Bilabial Fricative 


27 

Labiodental Fncative 

-f 

28 

Labiodental Fricative 

V 

29 

Dental Fncative 

G 

30 

Dental Fricative 


31 

Alveolar Fricative 

s 

32 

Alveolar Fricative 

z 

33 

Postalveolar Fricative 

i 

34 

Postalveolar Fricative 

3 

35 

Retroflex Fricative 

5 

36 

Retroflex Fricative 


37 

Palatal Fricative 
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S.No. 

Phoneme Type 

Symbol 

38 

Palatal Fricative 

1 

39 

Velar Fricative 

X 

40 

Velar Fricative 

y 

41 

Uvular Fricative 

X 

42 

Uvular Fricative 


43 

Pharyngeal Fricative 

ti 

44 

Pharyngeal Fricative 

q 

45 

Glottal Fricative 

h 

46 

Glottal Fricative 

- 

fi 

47 

Alveolar Lateral Fricative 


48 

Alveolar Lateral Fricative 

h 

49 

Labiodental Approximant 

u 

50 

Alveolar Approximant 

j 

51 

Retroflex Approximant 

4 

52 

Palatal Approximant 

1 

53 

Velar Approximant 

\J 

54 

Alveolar Lateral Approximant 

1 

55 

Retroflex Lateral Approximant 

1 

56 

Palatal Lateral Approximant 

C 

57 

Velar Lateral Approximant 

A 

L 

58 

Cardinal Vowel 

• 

L 

59 

Cardinal Vowel 

e 

60 

Cardinal Vowel 

£ 

61 

Cardinal Vowel 

a 

62 

Cardinal Vowel 

a 

63 

Cardinal Vowel 

0 

64 

Cardinal Vowel 

o 

65 

Cardinal Vowel 

u 


Table B 1: List of phoneme articulations of speech database 1 The consonants are 
those of the International Phonetic Alphabet 
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B.2 Database 2 — Phoneme Specific Sentences 

A set of phoneme specific sentences were proposed by Huggins and Nickerson [136] 
for the purpose of evaluation of the degradations of processed speech to specific 
phoneme types They proposed a set of four such sentences from a larger set These 
are as follows. 

(a) Why were you away a year, Roy'^ 

(b) Nanny may know my meaning 

(c) His vicious father has seizures 

(d) Which tea party did Baker go to^ 

These sentences are phoneme specific in that they concentrate all phonetic segments 
with similar acoustic properties in a single sentence and exclude as tar as possible all 
phonetic segments with dissimilar properties The above sentences include among 
them all the consonants of English except /B/, N and /)/ The first sentence contains 
only vowels and glides These sounds are characterized by an all pole spectra, which 
change slowly, and contain no abrupt changes in the amplitude level All the sounds 
are voiced and only slow changes of pitch occur 

The second sentence contains only (nasalized) vowels and nasals Like the first 
sentence, it is voiced throughout, and its spectrum and amplitude level change 
relatively slowly Both nasals and nasalized vowels contain zeros m their spectra 

The third sentence contains only voiced and unvoiced fricatives besides vowels 
Fricatives contain zeros in their spectra but the spectra themselves are very different 
from those of voiced sounds due to the noise excitation In this sentence utterance, 
the rates of amplitude change are still relatively low 

The fourth sentence contains (vowels and) all stops and affricates except l]l The 
spectrum and amplitude of the speech waveform change frequently and abruptly, 
and there are many voiced-unvoiced transitions 

The speech database composes of PCM recordings of the 4 sentences spoken by 
6 male and 6 female speakers t According to the information made available with 

t This database was provided by the AT & T Bell Laboratories, Murray Hill, N ] , 
USA 
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the database, the recorded speech was lowpass filtered at 3 9 kHz at the time ot 
recordmg Subsequently, they were digitally filtered with an FIR filter having a flat 
response between 125 and 3500 Hz and a roll-off of -5 dB/lOOHz above 3500 Hz 
The speech signal was sampled at 8 kHz at a resolution of 16 bits/sample The total 
duration of digitized speech is 120s 
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